Here is my take on it. A summary thread. 1/n
We have a text string, split into tokens (<=512). Each token gets a 768-dim vector. So we have a 2D matrix X of arbitrary width. We want to set up a feed-forward layer that would somehow transform X, keeping its shape.
Only acting on the embedding dimension would process each token separately, which is clearly not sufficient.
X @ X.T @ X has the same shape as X, and all tokens get to interact with all other tokens (via scalar products in X.T @ X). Neat!
f(x) = X @ X.T @ X is _almost_ a self-attention layer, but this f(x) has no trainable parameters.
Then tokens are added up with attention weights.
The interpretation is that this allows each token affect each other token based on the affinity they feel. 6/n
One such block has ~7M params.
BERT masks random tokens.
GPT always masks *the last* token only.
It’s the same thing!
Images are split into 16×16 patches. Patches play the role of tokens. 224×224 image gives 196 patches of 768 dimensions. This can directly go into a BERT-like architecture + classifier softmax on top. 10/n
Amazing that it works!
1. Intro lecture on Transformers by @giffmana:
2. Building nanoGPT from scratch by @karpathy:
I highly recommend all three! 12/12