From 4K to a Million Tokens: How RoPE Scaling, YaRN, and Ring Attention Stretch the Context Window

In early 2023, the open-weight Llama models shipped with a 2,048-token context window, and the first Llama 2 release doubled that to 4,096. Two years later it is unremarkable for a production model to advertise 128K, 200K, or a million tokens of context. The models did not get fundamentally larger between those two points, and almost none of that extension came from pretraining on longer documents. It came from a small stack of tricks that rescale how a model perceives position, plus a way to spread one very long sequence across many GPUs so the attention matrix never has to live on a single device.

The story is worth telling carefully because the marketing number on the box ("1M context") and the length at which a model still reasons reliably are often a factor of ten apart. Understanding the gap requires understanding the mechanism.

Why this matters: When a vendor doubles the advertised context window overnight without retraining, they almost certainly changed the position encoding, not the model's capacity. Knowing which technique they used tells you where the quality will quietly fall apart, and at what length.

TL;DR

Transformers have no built-in sense of order; position is injected separately. Rotary Position Embedding (RoPE) does it by rotating query and key vectors by an angle proportional to position, which is why position can be rescaled after training.
Naive extrapolation past the trained length fails catastrophically: unseen rotation angles in the high-frequency dimensions produce attention scores far outside the trained range.
Position Interpolation (PI) fixes this by squeezing positions back into the trained range; it works but blurs fine-grained local distance and needs fine-tuning.
NTK-aware scaling and YaRN spread the interpolation pressure unevenly across dimensions, preserving high-frequency detail and extending further with far less fine-tuning. YaRN reported 10x fewer tokens and 2.5x fewer training steps than PI.
LongRoPE searches for per-dimension scaling factors and reaches 2,048K tokens; it ships inside Microsoft's Phi-3.
Ring Attention is orthogonal to all of the above: it does not change positions at all. It shards the sequence across devices and passes key-value blocks around a ring, so memory per device drops by the device count.
The advertised window is an upper bound on tokens the model will accept, not on tokens it will use well. The RULER benchmark found that of 17 models claiming 32K or more, only four held up at 32K.

At a Glance

The two problems and the two families of solutions, on one page: position encoding decides whether the model understands far-apart tokens, and attention computation decides whether the hardware can hold them.

flowchart LR
  A["Short-context model<br/>trained at 4K"] --> B{"What blocks<br/>longer context?"}
  B -->|"Position unseen<br/>past 4K"| C["Position rescaling"]
  B -->|"Attention matrix<br/>too big for one GPU"| D["Distributed attention"]
  C --> C1["PI / NTK / YaRN<br/>/ LongRoPE"]
  D --> D1["Ring Attention"]
  C1 --> E["Usable long context"]
  D1 --> E
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class A blue
  class B slate
  class C,D purple
  class C1,D1 purple
  class E teal

Keep these two axes separate. Rescaling positions does nothing for memory; a 1M-token attention matrix is still enormous even if every position is encoded perfectly. Conversely, Ring Attention can hold a million tokens while the model still has no idea how to interpret a position it never saw in training. Real systems use both.

[IMAGE: Two-column schematic contrasting "position problem" (a number line where 4K is the edge of trained territory and 100K is off the map) against "memory problem" (an n-by-n attention grid overflowing a single GPU), with arrows pointing to the two solution families]

Before Long Context

The original Transformer used fixed sinusoidal position encodings added to the token embeddings (Vaswani et al., 2017, Attention Is All You Need, arXiv:1706.03762). Those are deterministic functions of position, defined in principle for any index, but models trained on short sequences never learned to use the high ones, so extrapolation was poor. Learned absolute position embeddings, used by BERT and GPT-2, were worse: an embedding for index 5,000 simply does not exist if you only trained 1,024 of them.

The shift that made modern long-context work possible was Rotary Position Embedding (Su et al., 2021, RoFormer: Enhanced Transformer with Rotary Position Embedding, arXiv:2104.09864). RoPE encodes position not by adding a vector but by rotating the query and key vectors, in two-dimensional slices, by an angle proportional to the token's position. Because attention depends on the dot product of a query and a key, and because rotating both by angles proportional to their positions makes the dot product depend only on their relative offset, RoPE bakes relative position directly into the attention score. This property, and the fact that the rotation angle is a continuous function of position, is exactly what later methods exploit: you can change the function that maps position to angle without touching a single weight.

timeline
  title Evolution of Long-Context Methods
  2017 : Sinusoidal encodings (Transformer)
  2021 : RoPE rotates Q and K (RoFormer)
  2023 Jun : Position Interpolation extends Llama to 32K
  2023 Aug : NTK-aware scaling and YaRN
  2023 Oct : Ring Attention shards across devices
  2024 Feb : LongRoPE reaches 2,048K tokens
  2024 Apr : RULER exposes the real-vs-claimed gap

The progression has a clear logic. Once RoPE made position a tunable function, the field spent a year and a half finding better functions, and in parallel solved the orthogonal problem of fitting the sequence on the hardware.

How RoPE Scaling Actually Works

The frequencies are the whole story

RoPE splits each query and key vector of dimension $d$ into $d/2$ pairs. Pair $i$ is rotated by an angle $m \theta_i$, where $m$ is the token position and $\theta_i$ is a fixed per-pair frequency:

\[\theta_i = b^{-2i/d}, \quad i = 0, 1, \dots, \tfrac{d}{2}-1\]

The base $b$ is conventionally 10,000. Low-index pairs ($i$ near 0) have $\theta_i$ near 1: they rotate quickly as position increases, completing a full turn every few tokens. These are the high-frequency dimensions, and they encode fine-grained local position. High-index pairs have tiny $\theta_i$: they rotate slowly, turning perhaps once over thousands of tokens. These low-frequency dimensions encode coarse, long-range position.

The wavelength of pair $i$, the number of tokens it takes to complete one full rotation, is:

\[\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \, b^{\,2i/d}\]

For the slowest dimensions, $\lambda_i$ can exceed the entire training context. Those dimensions never complete even one rotation during training, so the model only ever sees a narrow arc of their possible angles.

[IMAGE: Plot of RoPE wavelength versus dimension index for d=128, log y-axis, with a horizontal line at the 4,096-token training length showing how the top ~10 dimensions never complete one rotation]

The frequencies are not an implementation detail. They are the reason a model trained on 4K tokens behaves so differently at 8K depending on which extension trick you use, because each trick redistributes the strain across these frequencies differently.

Why naive extrapolation explodes

Suppose the model trained on positions $0$ to $L-1$ and you feed it position $L + 5000$. The fast-rotating high-frequency dimensions now produce rotation angles the model never encountered: a dimension with wavelength 8 tokens has wrapped around hundreds of extra times. The attention dot products that depend on these dimensions land far outside the distribution the network learned to handle, and the result is the "catastrophically high attention scores" that Position Interpolation's authors describe, where the softmax collapses and the model produces garbage (Chen et al., 2023, Extending Context Window of Large Language Models via Position Interpolation, arXiv:2306.15595).

[IMAGE: Side-by-side attention-score distributions, one within the trained range (bounded, well-behaved) and one extrapolated past it (heavy tail with extreme outliers), illustrating softmax collapse]

Position Interpolation: squeeze, do not stretch

The first robust fix was almost embarrassingly simple. Instead of letting position $m$ run past the trained range, scale it down so it fits:

\[m' = m \cdot \frac{L}{L'}\]

where $L$ is the original trained length and $L'$ is the target length. To reach 8x context, every position index is multiplied by $1/8$. Position 30,000 is presented to the rotation machinery as if it were position 3,750, comfortably inside trained territory. Chen and colleagues showed the interpolation bound is at least roughly 600x tighter than the extrapolation bound, which is the formal version of "interpolating is safe, extrapolating is not." With about 1,000 fine-tuning steps, they extended Llama from 7B to 65B up to 32,768 tokens.

The cost is resolution. Squeezing positions by 8x means two tokens that were 8 apart now look 1 apart to the high-frequency dimensions, so the model's ability to distinguish nearby positions degrades. That blurring is why PI needs fine-tuning to recover, and why the next generation of methods refused to scale every dimension by the same factor.

NTK-aware scaling: protect the high frequencies

The insight behind NTK-aware scaling, which emerged from the open-source community in mid-2023 before being formalized in YaRN, is that uniform interpolation punishes the high-frequency dimensions the most while the low-frequency dimensions had room to spare. Instead of scaling positions, NTK-aware scaling changes the base $b$:

\[b' = b \cdot s^{\,d/(d-2)}\]

where $s = L'/L$ is the extension factor. Because the frequencies depend on the base raised to a dimension-dependent power, raising the base barely touches the high-frequency pairs (which were fine) and stretches the low-frequency pairs (which had slack). The interpolation pressure is moved off the dimensions that carry local detail and onto the ones that carry coarse position. This is why NTK-aware scaling can extend context modestly even with no fine-tuning at all.

YaRN: ramp between interpolation and extrapolation

YaRN ("Yet another RoPE extensioN method") generalizes the NTK idea into a per-dimension decision (Peng et al., 2023, YaRN: Efficient Context Window Extension of Large Language Models, arXiv:2309.00071). It classifies each frequency by how many full rotations its wavelength completes within the original context window. Dimensions whose wavelength is much shorter than the context (many rotations) are left to extrapolate untouched, because the model has seen their full range. Dimensions whose wavelength exceeds the context (less than one rotation) are interpolated like PI, because the model has only seen a fraction of their range. Between those extremes, YaRN applies a smooth ramp. It adds one more trick: a temperature factor on the attention logits, scaling them by a constant that depends on the extension factor, which the authors found recovers perplexity lost in the interpolation. YaRN reported reaching the same quality as PI with roughly 10x fewer tokens and 2.5x fewer training steps, and released Llama 2 checkpoints at 64K and 128K context.

LongRoPE: search instead of derive

Where YaRN derives its per-dimension treatment from a clean rule, LongRoPE searches for it (Ding et al., 2024, LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, arXiv:2402.13753). It uses an evolutionary search to find non-uniform rescaling factors per RoPE dimension and per token position, exploiting two kinds of non-uniformity that hand-derived methods leave on the table. With that better initialization it can manage an 8x extension with no fine-tuning, and with a progressive strategy (fine-tune to 256K, then interpolate again) it reaches 2,048K tokens. LongRoPE is integrated into Microsoft's Phi-3 models, which is a useful reminder that these are production techniques, not paper curiosities.

Seeing It in Motion

The four position-encoding methods are best understood as different answers to one question: which dimensions do you protect, and which do you sacrifice?

flowchart TD
  Start["Target: extend 4K to 32K (8x)"] --> Q{"How to map<br/>old to new positions?"}
  Q -->|"Scale all positions<br/>by 1/8"| PI["Position Interpolation<br/>uniform, blurs local detail"]
  Q -->|"Raise the base,<br/>spare high freq"| NTK["NTK-aware<br/>some zero-shot extension"]
  Q -->|"Ramp per dimension<br/>by wavelength"| YARN["YaRN<br/>+ logit temperature"]
  Q -->|"Search per-dim<br/>scale factors"| LR["LongRoPE<br/>reaches 2,048K"]
  PI --> FT["Needs fine-tuning"]
  NTK --> OK["Often no fine-tuning"]
  YARN --> EFF["10x fewer tokens vs PI"]
  LR --> PROG["Progressive: 256K then 2M"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  class Start blue
  class Q purple
  class PI,NTK,YARN,LR purple
  class FT amber
  class OK,EFF,PROG emerald

The memory problem has a completely different shape. Ring Attention does not care which position-encoding method you use; it solves the fact that a full attention matrix over $n$ tokens needs $O(n^2)$ space, which overflows a single accelerator long before a million tokens.

graph TD
  subgraph Ring["Ring of 4 devices"]
    G0["GPU 0<br/>seq block 0<br/>+ KV block"]
    G1["GPU 1<br/>seq block 1<br/>+ KV block"]
    G2["GPU 2<br/>seq block 2<br/>+ KV block"]
    G3["GPU 3<br/>seq block 3<br/>+ KV block"]
  end
  G0 -->|"pass KV"| G1
  G1 -->|"pass KV"| G2
  G2 -->|"pass KV"| G3
  G3 -->|"pass KV"| G0
  Note["Each GPU holds 1/N of the sequence;<br/>KV blocks rotate around the ring"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class G0,G1,G2,G3 blue
  class Note slate

Each device holds one block of the query sequence and, at any moment, one block of keys and values. While a device computes blockwise attention between its local queries and its current KV block, it simultaneously sends that block to the next device in the ring and receives the next one. After $N$ steps every query block has attended to every key block, and no device ever materialized the full matrix. The authors sized the blocks so that KV communication fully overlaps computation, which is what makes it free in wall-clock terms (Liu et al., 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, arXiv:2310.01889). The achievable sequence length scales with the number of devices, because each new device adds its memory to the pool rather than duplicating the whole sequence.

[IMAGE: Animated-style sequence of a 4-GPU ring over 4 timesteps, showing the KV block "baton" rotating one hop per step while each GPU's local query block stays fixed, with the accumulating attention output highlighted]

By the Numbers

The headline figures for the position-encoding methods come from their respective papers. Treat the "max context" column as the length the authors demonstrated, not a quality guarantee.

Method	Year	Max context shown	Fine-tuning	Key mechanism
Position Interpolation	2023	32,768 (Llama 7B–65B)	~1,000 steps	Uniform position downscaling
NTK-aware scaling	2023	~8K–16K zero-shot	Often none	Base change, spares high freq
YaRN	2023	128K (Llama 2)	10x fewer tokens vs PI	Per-dimension ramp + logit temp
LongRoPE	2024	2,048K	Progressive, ≤1K steps at 256K	Searched non-uniform scaling

Sources: (Chen et al., 2023, arXiv:2306.15595); (Peng et al., 2023, arXiv:2309.00071); (Ding et al., 2024, arXiv:2402.13753).

The memory and compute scaling tells the other half of the story. The quadratic term is what Ring Attention attacks, and the per-device numbers are what make a million-token context physically possible.

Quantity	Scaling	What it means at length $n$
Attention compute	$O(n^2)$	Doubling context quadruples attention FLOPs
Attention memory (naive)	$O(n^2)$	The matrix that does not fit on one GPU
Attention memory (FlashAttention)	$O(n)$	Never materializes the full matrix on-chip
KV cache memory	$O(n)$ per layer	The dominant inference cost at long context
Ring Attention memory per device	$O(n/N)$	Drops with device count $N$

[IMAGE: Stacked bar of inference memory at 4K, 32K, and 128K context, breaking out model weights versus KV cache, showing the KV cache overtaking weights as context grows]

The KV cache deserves emphasis because it, not the attention matrix, is usually the binding constraint at inference time. Every token's keys and values must be stored for every layer for the entire generation, and that storage grows linearly with context. StreamingLLM attacks exactly this by keeping only a small window of recent tokens plus a few initial "attention sink" tokens, which let Llama-2 and others run on streams of up to four million tokens with up to a 22.2x speedup over sliding-window recomputation, though at the cost of forgetting the middle of the stream (Xiao et al., 2023, Efficient Streaming Language Models with Attention Sinks, arXiv:2309.17453).

A Concrete Example

Walk one extension end to end. Take a model with head dimension $d = 128$, trained at $L = 4{,}096$, RoPE base $b = 10{,}000$. The goal is 32,768 tokens, an extension factor $s = 8$.

First, look at two representative dimensions before extension. The fastest pair, $i = 0$, has $\theta_0 = 10000^{0} = 1$, so wavelength $\lambda_0 = 2\pi \approx 6.3$ tokens: it completes about 650 full rotations inside the 4,096-token window, so the model has seen its entire range many times over. The slowest pair, $i = 63$, has $\theta_{63} = 10000^{-126/128} \approx 1.1 \times 10^{-4}$, so $\lambda_{63} \approx 57{,}000$ tokens: it does not complete even one rotation inside 4,096, so the model has only ever seen about 7% of its circle.

Now apply each method to position $m = 30{,}000$:

Method	What happens to position 30,000	Effect on fast dim $(i=0)$	Effect on slow dim $(i=63)$
Naive extrapolation	Fed as 30,000 directly	Angle far outside trained arc, score explodes	Mostly fine, range still unseen anyway
Position Interpolation	Rescaled to $30000/8 = 3{,}750$	Local distances compressed 8x, detail blurred	Now within trained arc
NTK-aware ($b' \approx 10000 \cdot 8^{128/126}$)	Base raised to ~83,000	Almost unchanged, detail preserved	Stretched to cover the new range
YaRN	Fast dim extrapolated, slow dim interpolated, ramp between	Untouched	Interpolated, plus logit temperature applied

The fast dimension carries "is this token 3 or 4 positions back," and you can see why YaRN wins: it is the only method that both protects that fast dimension and brings the slow dimension into range. PI sacrifices the fast dimension; naive extrapolation breaks it. After the remapping, the model still has to compute attention over 32,768 tokens. If that exceeds one GPU's memory, the sequence is split across, say, four devices with Ring Attention, each holding 8,192 query positions and rotating KV blocks until every query has seen every key. The position trick made the tokens interpretable; the ring made them fit.

[IMAGE: Unit-circle diagram for one fast and one slow RoPE dimension, showing position 30,000's angle landing inside the trained arc under YaRN but outside it under naive extrapolation]

Where It Breaks

The most important failure mode is not a crash but a quiet degradation, and it predates all the scaling tricks. Even genuine long-context models attend unevenly across their window: performance is highest when the relevant information sits at the very start or the very end of the context and sags in the middle, producing a U-shaped accuracy curve (Liu et al., 2023, Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172). A model with a 128K window that places your key fact at token 60,000 may simply fail to use it, and no position-encoding trick fixes this, because the problem is in how attention was trained, not in how position is encoded.

[IMAGE: U-shaped accuracy curve from the Lost-in-the-Middle setup, retrieval accuracy on the y-axis against the position of the relevant document on the x-axis, with the trough labeled "lost in the middle"]

The second trap is benchmarking by the easiest possible test. The popular "needle in a haystack" retrieval test, finding one planted sentence in a long document, is nearly saturated and barely indicative of real comprehension. When NVIDIA's RULER benchmark added multi-hop tracing and aggregation tasks across controlled lengths, it found that of 17 models claiming 32K context or more, only four (GPT-4, Command-R, Yi-34B, and Mixtral, as of that evaluation) sustained satisfactory performance at 32K (Hsieh et al., 2024, RULER: What's the Real Context Size of Your Long-Context Language Models?, arXiv:2404.06654). The advertised number is a ceiling on input length, not a floor on capability.

Method-specific failures matter too. Position Interpolation without enough fine-tuning leaves the model fluent but imprecise about local order. NTK-aware scaling at large extension factors eventually pushes the slow dimensions past where even the base change can help, and quality falls. YaRN is more robust but still assumes the original model had usable signal in every frequency band, which is not always true. StreamingLLM's attention-sink trick keeps the model running forever but genuinely discards the middle of the stream, so it is a fluency solution, not a memory solution. Ring Attention adds no quality loss but introduces a real engineering burden: its throughput depends on KV-block communication overlapping computation, so a slow interconnect turns the elegant overlap into a bottleneck.

Alternative Designs

Position rescaling is the dominant approach for RoPE-based models, but it is not the only way to reach long context, and the alternatives make different bets.

Approach	Strengths	Weaknesses	Best when
RoPE scaling (PI/YaRN/LongRoPE)	Cheap, no architecture change, large extension factors	Quality degrades unevenly; needs validation per length	Extending an existing RoPE model
Ring / sequence-parallel attention	Exact attention, near-linear memory per device, no quality loss	Needs many devices and fast interconnect	Training or serving very long exact context
Sparse / sliding-window attention	Sub-quadratic compute, simple	Cannot attend everywhere; loses long-range links	Streaming or locality-dominated tasks
Attention sinks (StreamingLLM)	Infinite streams, large speedup, no fine-tuning	Discards the middle; not true recall	Endless chat or log streams
Retrieval (RAG)	Unbounded corpus, cheap per query	Retrieval errors cascade; no global reasoning	Knowledge that exceeds any window
Linear-time architectures (SSMs, Mamba)	Linear scaling by construction	Different model, weaker exact recall so far	Building a long-context model from scratch

The honest framing is that these are complementary, not competing. A production system might extend a base model with YaRN, serve it with Ring Attention across a node of GPUs, and still wrap it in retrieval because no window is large enough for an entire codebase. Each layer addresses a different bottleneck.

How It Is Used in Practice

The clearest production signal is that these methods ship inside named models. LongRoPE is integrated into Microsoft's Phi-3 family, giving small models long context without a from-scratch long-context pretraining run. YaRN's released Llama 2 checkpoints at 64K and 128K were widely used as drop-in bases, and YaRN-style configurations now appear across open models, where a single rope_scaling block tells the serving stack which extension to apply. Hugging Face Transformers and vLLM both implement linear (PI), NTK, and YaRN scaling as selectable options, which is why extending a model's context is usually a configuration change, not a code change.

On the memory side, sequence parallelism in the spirit of Ring Attention is how frontier labs train on long documents at all; a million-token activation set does not fit on one accelerator, so the sequence is sharded by construction. FlashAttention (Dao et al., 2022, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, arXiv:2205.14135) is the substrate underneath all of this: by never writing the full attention matrix to high-bandwidth memory it makes per-device memory linear in sequence length, and Ring Attention is essentially FlashAttention's blockwise computation extended across devices instead of across on-chip tiles.

The operational lesson teams learn the hard way is to benchmark at the length they will actually deploy, with a task that resembles their workload, rather than trusting the spec sheet.

Insights Worth Remembering

The deepest idea here is that RoPE turned position into a continuous, tunable function, and everything downstream is a different choice of that function. Once you see position encoding as a dial rather than a fixed property, "extending context" stops being mysterious.

Uniform scaling is the naive choice and the wrong one, because the high-frequency dimensions that encode local order have no slack while the low-frequency dimensions have plenty. Every method better than PI is, in essence, a smarter way to distribute the same interpolation budget unevenly across frequencies.

Position and memory are independent problems that get conflated under the single phrase "long context." A model can encode a million positions perfectly and still not fit them in memory, or hold them in memory and not understand them. Strong systems solve both, with different tools.

The advertised context window is a marketing ceiling, not a capability floor. The useful length is whatever your own task-shaped evaluation says it is, and it is frequently a fraction of the number on the box.

Streaming and recall are different goals. Attention sinks give infinite fluent generation by forgetting the middle; if your application needs the middle, that is the wrong tool no matter the token count.

Finally, retrieval and long context are not rivals. The largest practical window is still finite, so the question is rarely "context or retrieval" but "how much of each."

Open Questions

Several things are genuinely unsettled. It is established that current models degrade in the middle of long contexts; whether that is a fixable training problem or a deeper limitation of softmax attention's capacity to allocate probability across very long sequences remains open. Position-aware fine-tuning can flatten the U-shape, but no method has eliminated it.

The relationship between extension factor and quality is empirical, not theoretical. We can measure that YaRN holds up to 128K and LongRoPE demonstrates 2,048K, but there is no principled account of where a given model's frequencies stop carrying usable signal, so each extension is validated by benchmark rather than predicted.

Whether linear-time architectures will displace the RoPE-plus-Ring-Attention stack is unresolved. State-space models and their hybrids scale linearly by construction and sidestep position encoding entirely, but as of early-to-mid 2025 they trail attention on tasks requiring precise long-range recall. If that gap closes, the scaling-tricks edifice becomes transitional; if not, exact attention with clever position encoding remains the workhorse. The evidence is not yet decisive.