Here is my take on it. A summary thread. 1/n

We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.

Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.

X @ X.T @ X has the same shape as X, and all tokens get to interact with all other tokens (via scalar products in X.T @ X). Neat!

f(x) = X @ X.T @ X is _almost_ a self-attention layer, but this f(x) has no trainable parameters.

But conceptually, values can be folded into MLP later on, and keys are redundant given queries. We basically need only 1 matrix (KQ) that defines inner product. 5/n

Then tokens are added up with attention weights.

The interpretation is that this allows each token affect each other token based on the affinity they feel. 6/n

An MLP layer (with 1 hidden layer) processes each token separately. Each token is processed using the same identical weights, so it’s like a 1-convolution. 7/n

One such block has ~7M params.

BERT-base stacks 12 blocks, giving ~85M params. (Plus another 25M for initial token encoding.) 8/n

BERT masks random tokens.

GPT always masks *the last* token only.

This is what makes GPT good at generating texts: it’s trained to always append a token. 9/n

It’s the same thing!

Images are split into 16×16 patches. Patches play the role of tokens. 224×224 image gives 196 patches of 768 dimensions. This can directly go into a BERT-like architecture + classifier softmax on top. 10/n

Amazing that it works!

But with a big caveat: it requires a fixed input size 🙁 11/n

1. Intro lecture on Transformers by @giffmana:

2. Building nanoGPT from scratch by @karpathy:

3. Analyzing attention-only circuits by @ch402 and team: transformer-circuits.pub/2021/framework…

I highly recommend all three! 12/12