You cannot demand of an engineer to “not worry about the name or standard execution of each move – it’s all feeling”. LWYMMD. I systematised it.
In the last 2 months, I’ve begun taking Cuban salsa classes at the local designated club. I’ve gotten so hooked that I genuinely can’t remember what occupied my thoughts and free time in the 23.75 years prior to this, and of course, like all excellent students, I’ve been taking notes.
Here’s a fun fact about me: I graduated with the joint title of engineer (ir.) and master of science (MSc) in computer science, without anyone ever teaching me about sorting algorithms. It was only a couple months after I was hired as a PhD student that I taught myself sorting algorithms, and that was because I had to teach them to bachelor students in turn the week after. Inevitably, while studying other data structures, I also encountered the concept of amortisation. I vaguely remembered coming across its Wikipedia page once and taking away that it meant something like “proving that something is fast via black magic”. And yet, it really is not that deep.
A great example of why you need double-entry bookkeeping and balance sheets.
There are (at least) two claims that have never made sense to me ever since I heard them. One is that double-entry bookkeeping, where you track the capital inside a company by keeping two lists of money amounts that add up to the same total, is actually useful and not a formality that is 50% redundant. The other is that fractional-reserve banking (FRB), whereby a bank is not forced to have at least as much cash on hand as the combined total of all account balances open at that bank, allows banks to create cycles of infinite money. The former is true but hard to believe at first. The latter is false but hard to disbelieve at first. Both become apparent when considering the typical claim about FBR through the lens of double-entry bookkeeping.
under Posts / Explainers about Language, Modelling
Disentangling the strangeness of relative distance and more.
The promise of DeBERTa is that it does away with absolute positional embeddings added at the start of the transformer encoder, and replaces it instead with an attention mechanism that takes into account the relative distance between tokens when doing attention, in every head of every layer. Since absolute positional embeddings are the only part of a transformer whose dimensionality is related to the length of the input, would taking it out mean DeBERTa can process an infinite amount of tokens at once? Let’s analyse.
The famous roberta-base HuggingFace checkpoint is a serialised version of a RobertaForMaskedLM model, consisting of a roberta field and an lm_head field. Yet, despite this, you can still call .from_pretrained("roberta-base") on RobertaForTokenClassification and get an object that has a roberta field with exactly the checkpoint’s roberta weights, but a head with a different architecture and randomly initialised weights. Even more strikingly, you can call .from_pretrained("roberta-base") on RobertaModel, which is what normally sits inside that roberta field and consists of the fields embeddings and encoder, and somehow it can still match all the relevant checkpoint weights to the correct fields. Ever wondered how that’s done? Here’s how.
Cross-platform app development is surprisingly easy in 2024. Learnt it in a weekend.
Knowing how to build mobile apps is likely going to be a skill that will stay relevant for the foreseeable future. I’ve had some ideas for mobile apps in the past, and wanted to make sure that if I was going to learn to develop them, it had better be in a framework that allows having one code base and yet runs on both iOS and Android. We’re in luck: Google’s Flutter framework, built on top of the Dart language, does exactly that. Here are my notes learning it.
A short tutorial on how to lay out a repo, declare metadata, installing editable code, and doing it recursively.
At the time of writing, I’ve publicly released a handful of Python packages (five, to be precise: Fiject, BPE-knockout, TkTkT, LaMoTO, and MoDeST) covering various parts of the NLP research workflow, with more on the way. It took me a bit to learn how to set up Python code as a package (also inaccurately called “library” or more accurately called a “module”), and as I later discovered, it’s not so trivial to have one custom package be installed automatically upon installing another custom package, especially when you are the author of both and are already using a working version. Let’s dive straight in!
under Posts / Explainers about Language, Modelling
A short formalisation of an obscure metric.
I was recently reading a paper on how to train an adapter between a given model and any tokeniser, and noticed that they measured their causal language modelling performance in bits-per-character (BPC) rather than the more standard perplexity (PPL) with no citation to show for it. Let’s dive into that!