Positional Encodings

Self-attention treats its input as a set, not a sequence. Permute the tokens and the raw attention computation returns the same values in permuted order; nothing in softmax(QK^T/sqrt(d_k))V knows that token 5 came after token 4. Strip positional information out of a trained transformer and "the dog bit the man" becomes indistinguishable from "the man bit the dog". Everything a language model knows about order, it knows because position was injected somewhere. How you inject it decides, more than almost any other design choice, how far past its training length the model can still read.

Absolute encodings: add a position signal

The original transformer added a fixed sinusoidal signal to each token embedding: position p, dimension i, gets sin or cos of p / 10000^(2i/d). Different dimensions oscillate at different wavelengths, so the network can in principle read both fine and coarse position. BERT and GPT-2 instead learned an embedding per absolute position, one row of a table indexed by slot 0, 1, 2, and so on.

Both share a fatal limitation for long context. A learned table has no row for position 4096 if you only trained to 2048; the model has literally never seen that input and behaves erratically. Sinusoidal encodings are defined for any position but were not trained on the far-out wavelengths, so extrapolation degrades fast. Absolute position also encodes the wrong thing: what attention usually wants is not "token 312 attending to token 7" but "a token attending to something six places back". Relative distance is the load-bearing quantity, and absolute schemes only express it indirectly.

RoPE: rotate the query and key

Rotary Position Embedding (RoPE) is the modern default, used by Llama, Mistral, Qwen, and most open models. Instead of adding a vector, it rotates the query and key vectors by an angle proportional to their position. Pairs of dimensions are treated as 2D coordinates and spun by m*theta at position m, with each pair assigned a different base frequency theta.

The elegance is what happens in the dot product. When you take the inner product of a query rotated by m*theta and a key rotated by n*theta, the absolute angles cancel and only the difference (m - n)*theta survives. So RoPE encodes absolute position at write time but the attention score depends only on relative distance, exactly the quantity that generalises. Low-frequency dimension pairs rotate slowly and carry long-range position; high-frequency pairs rotate fast and carry local order. RoPE also leaves the vector norm untouched, so it composes cleanly with the rest of the block.

ALiBi: bias the scores by distance

ALiBi (Attention with Linear Biases) takes a blunter route and skips embeddings entirely. It adds a penalty straight onto the pre-softmax attention scores, linear in the distance between query and key and scaled by a fixed per-head slope:

score(i, j) = q_i . k_j  -  slope_h * (i - j)

Tokens further back are penalised more, so each head attends to a soft, recency-biased window whose width is set by its slope. Because the bias is just a function of distance with no trained parameters tied to a maximum length, an ALiBi model trained on 1024 tokens keeps producing sensible perplexity at 2048 and beyond. The paper's title, "Train Short, Test Long", is the whole pitch. The cost is expressiveness: that monotonic recency penalty makes it harder to attend sharply to a single distant token, which is exactly what retrieval-style long-context tasks demand.

The NoPE surprise

A 2023 result complicated the tidy story. On length generalisation for small decoder-only models, no positional encoding at all (NoPE) matched or beat ALiBi and RoPE. A causal mask alone breaks permutation symmetry: token i can see j only if j <= i, so the model can recover position by counting how many tokens it is allowed to attend to, and it learns attention patterns resembling relative schemes without being told to. The lesson is not "delete your positional encoding". Frontier models still ship RoPE because it gives sharper control at scale. The lesson is that position is partly emergent from causality, and explicit encodings are a strong prior, not a hard requirement.

Stretching the window after training

RoPE's relative property is the reason context-extension tricks work. Push a RoPE model past its training length and the high-frequency pairs rotate into angles they never saw, and quality collapses. Position interpolation rescales positions so 8192 tokens map back into the 0-2048 range the model trained on, trading a little local resolution for a much longer window. YaRN refines this by interpolating the slow, long-range frequencies while leaving fast local ones nearly untouched, and recovers most of the long-context quality with roughly an order of magnitude less fine-tuning than naive interpolation. None of this is free, but it is the standard path from a short-context base model to a long-context release.

When it falls down

Extrapolation is not the same as use. A model that stays numerically stable at 32k tokens may still ignore the middle of that window. Positional encoding fixes representability, not retrieval (see context-windows-long-context).
RoPE high-frequency aliasing. The fastest-rotating pairs alias first when you exceed training length; that is why naive context extension corrupts local syntax before it corrupts long-range structure, and why YaRN treats frequency bands differently.
ALiBi and sharp recall. ALiBi's recency bias suits language modelling but underperforms when a task needs to attend precisely to one far-back token.
Tokeniser interaction. Position counts tokens, not words. A verbose tokenisation of non-English text (see tokenisation-bpe) consumes positional budget faster, so effective context in those languages is shorter than the advertised token count suggests.