Context Windows and Long-Context Models

The advertised context length is a marketing number and an engineering ceiling; the usable context is something smaller and harder to measure. A model can accept a million tokens, keep its perplexity stable across all of them, and still fail to answer a question whose evidence sat in the middle of the input. Two failure surfaces hide behind one number. The first is whether the model can represent and process a sequence that long at all. The second is whether it actually attends to the right part of it. Confusing the two is the most common mistake teams make when they buy "long context" and wonder why retrieval still beats it.

What sets the ceiling

Three pressures bound how long a window you can train and serve.

Attention compute is O(n^2). Every token attends to every prior token, so doubling the sequence quadruples the attention cost (see attention-mechanism). At 100k+ tokens this dominates the forward pass, which is why long context and inference-optimisation work are joined at the hip.
KV-cache memory is O(n). During generation the model caches a key and value vector per layer per token. That cache grows linearly and, at long context, dwarfs the model weights in memory; it is the real reason a 1M-token request is expensive to serve (see kv-cache).
Positional representability. The encoding has to stay sane at positions the model never trained on, which is the entire job of RoPE interpolation and YaRN (see positional-encodings).

Clear all three and you have a window that is long on paper. None of them guarantees the model uses it well.

Lost in the middle

The sharpest evidence for the gap is "Lost in the Middle". Across multi-document QA and key-value retrieval, models recall information best when it sits at the start or end of the input and measurably worse when the same fact sits in the middle, tracing a U-shaped accuracy curve. The effect persists in models explicitly built and advertised for long context. So a document stuffed with 80k tokens of preamble before the relevant clause is a setup for failure, not because the model cannot hold 80k tokens, but because it systematically under-weights the centre. This single result is why retrieval-augmented generation (see retrieval-augmented-generation) keeps winning: pulling the right 2k tokens to the top of the prompt sidesteps the positional bias entirely.

Attention sinks

Part of why long context is fragile shows up in a strange empirical fact: transformers dump a large, near-constant share of attention onto the first few tokens, regardless of whether those tokens carry any meaning. StreamingLLM named these attention sinks and explained the mechanism. Softmax forces the attention weights to sum to one, so when a head has nothing it genuinely wants to look at, it has to put that mass somewhere; the initial tokens, always visible under the causal mask, become the default dump. The practical consequence is concrete: naive sliding-window attention that evicts those first tokens to save memory causes quality to fall off a cliff, because you threw away the sink the model relies on. Keep a handful of initial tokens plus a recent window and a model trained on finite context streams over millions of tokens without fine-tuning.

What sets the ceiling

Lost in the middle

Attention sinks

Keep reading with Pro.