Attention Mechanism

Attention answers a simple question: when generating a token, which prior tokens matter? Before attention, recurrent networks compressed everything into a fixed hidden state and forgot most of it. Attention sidesteps the bottleneck by letting every output position look at every input position directly.

The core math

Given queries Q, keys K, values V, attention is:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

The sqrt(d_k) scaling stops the dot products from growing too large as dimension increases. Softmax normalises the alignment scores into a probability distribution that you can then use to weight the value vectors.

Self-attention vs cross-attention

Self-attention has Q, K, V all derived from the same sequence. Used inside both encoder and decoder blocks.
Cross-attention has Q from one sequence and K, V from another. The bridge between encoder and decoder in encoder-decoder models.

Why it works

A single attention head learns one type of dependency: nearest-neighbour, syntactic, coreference. Multiple heads in parallel each specialise. Stacking many transformer layers composes those dependencies into hierarchical structure.

When attention gets expensive

Self-attention is O(n^2) in sequence length. For long contexts (100k+ tokens) that quadratic cost is the dominant bottleneck. Sparse, linear, and recurrent variants (FlashAttention, Longformer, Mamba) all attack this from different angles.