Here is my take on it. A summary thread. 1/n
We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.
Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.
X @ X.T @ X has the same shape as X, and all tokens get to interact with all other tokens (via scalar products in X.T @ X). Neat!
f(x) = X @ X.T @ X is _almost_ a self-attention layer, but this f(x) has no trainable parameters.
But conceptually, values can be folded into MLP later on, and keys are redundant given queries. We basically need only 1 matrix (KQ) that defines inner product. 5/n
Then tokens are added up with attention weights.
The interpretation is that this allows each token affect each other token based on the affinity they feel. 6/n
An MLP layer (with 1 hidden layer) processes each token separately. Each token is processed using the same identical weights, so it’s like a 1-convolution. 7/n
One such block has ~7M params.
BERT-base stacks 12 blocks, giving ~85M params. (Plus another 25M for initial token encoding.) 8/n
BERT masks random tokens.
GPT always masks *the last* token only.
This is what makes GPT good at generating texts: it’s trained to always append a token. 9/n
It’s the same thing!
Images are split into 16×16 patches. Patches play the role of tokens. 224×224 image gives 196 patches of 768 dimensions. This can directly go into a BERT-like architecture + classifier softmax on top. 10/n
Amazing that it works!
But with a big caveat: it requires a fixed input size 🙁 11/n
1. Intro lecture on Transformers by @giffmana:
2. Building nanoGPT from scratch by @karpathy:
3. Analyzing attention-only circuits by @ch402 and team: transformer-circuits.pub/2021/framework…
I highly recommend all three! 12/12