Architectures & Scaling
The Attention Mechanism
Attention lets every token decide which other tokens to look at. It's the core operation that made transformers replace RNNs.
intermediate · 8 min read
The intuition
When you read "the cat sat on the mat", deciding what mat refers to requires looking back at cat, sat, on. Older RNN architectures processed tokens one at a time; long-range references suffered.
Attention lets every token look at every other token directly, weighting them by relevance.
The math, briefly
For each token, the model produces three vectors:
- Query (Q) — what am I looking for?
- Key (K) — what do I match against?
- Value (V) — what do I contribute if matched?
For a token at position i, its output is:
attention(i) = softmax(Q_i · K^T / sqrt(d)) · V
The softmax turns dot-product scores into a probability distribution over all other tokens. The division by sqrt(d) keeps gradients well-scaled.
Multi-head attention
A single attention head learns one type of relationship. Multi-head attention runs several in parallel — one head might learn syntax, another long-range coreference, another positional adjacency.
Why this scales worse than you'd like
Attention is O(n²) in sequence length: every token attends to every token. A 100k-token context needs 10 billion attention scores. That's why long-context models are an engineering frontier (FlashAttention, sliding windows, sparse attention, Mamba-style state-space alternatives).
Read the source
The paper "Attention Is All You Need" (Vaswani et al., 2017) is short and worth reading once you have the intuition above.