NLP Foundations
Attention Mechanism
How attention lets a model focus on the relevant parts of a sequence by computing weighted dependencies between every pair of positions.
intermediate · 7 min read
Attention answers a simple question: when generating a token, which prior tokens matter? Before attention, recurrent networks compressed everything into a fixed hidden state and forgot most of it. Attention sidesteps the bottleneck by letting every output position look at every input position directly.
The core math
Given queries Q, keys K, values V, attention is:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
The sqrt(d_k) scaling stops the dot products from growing too large as dimension increases. Softmax normalises the alignment scores into a probability distribution that you can then use to weight the value vectors.
Self-attention vs cross-attention
- Self-attention has
Q,K,Vall derived from the same sequence. Used inside both encoder and decoder blocks. - Cross-attention has
Qfrom one sequence andK,Vfrom another. The bridge between encoder and decoder in encoder-decoder models.
Why it works
A single attention head learns one type of dependency: nearest-neighbour, syntactic, coreference. Multiple heads in parallel each specialise. Stacking many transformer layers composes those dependencies into hierarchical structure.
When attention gets expensive
Self-attention is O(n^2) in sequence length. For long contexts (100k+ tokens) that quadratic cost is the dominant bottleneck. Sparse, linear, and recurrent variants (FlashAttention, Longformer, Mamba) all attack this from different angles.