LoRA for Long-Context Adaptation

A 7B model trained on 4 096-token sequences cannot simply be asked to process 32 000 tokens: the rotary position embeddings (RoPE) were never seen at angles beyond the training range, and the attention matrices grow quadratically with sequence length. Fine-tuning the full model to fix this costs tens of thousands of GPU-hours. LongLoRA (Chen et al., 2023) demonstrated that a carefully structured LoRA approach can extend Llama-2 7B to 100 000 tokens on a single 8×A100 node - a gap of roughly 25× - but only after identifying two non-obvious failure modes that vanilla LoRA misses entirely.

Why vanilla LoRA under-delivers on long context

Standard LoRA freezes all pretrained weights and learns two small matrices, \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\), added to each target projection:

\[W' = W_0 + \Delta W = W_0 + BA\]

The rank \(r \ll d\) keeps parameter count low. For task adaptation this works well because the base model already "knows" the task distribution; LoRA nudges the residual.

Long-context adaptation is different. The model has never encountered position indices beyond the training cutoff. RoPE encodes relative distance by rotating query and key vectors by angles \(\theta_i \cdot m\) where \(m\) is the position and \(\theta_i\) decreases geometrically with head dimension \(i\). At positions far outside the training range, those cosine/sine values enter regions the model was never optimised over. Low-rank updates to attention projections cannot compensate for this because the error is in the positional geometry, not the weight subspace.

Two independent but complementary fixes are therefore needed:

Position rescaling - bring out-of-range position indices back into a familiar angular neighbourhood.
Efficient attention during training - make the quadratic attention cost tractable so the model actually sees long sequences.

Position interpolation: the prerequisite

Before applying any LoRA, the position indices must be remapped. The simplest approach, proposed by Chen et al. (2023, "Extending Context Window via Positional Interpolation"), is linear interpolation: divide every position index by the ratio \(s = L_\text{new} / L_\text{orig}\), so the maximum index seen during fine-tuning maps to \(L_\text{orig}\) rather than \(L_\text{new}\).

\[m' = m / s\]

This compresses previously unseen positions back into the angle range the model trained on. The cost is a small blurring of nearby-token distinctions - positions that were once distinct now share a tighter angular neighbourhood. A brief LoRA fine-tuning (roughly 1 000 steps) corrects for this blurring.

Without the interpolation step, LoRA fine-tuning on long sequences shows near-random perplexity on the extended portion of the context even after thousands of gradient steps - the model cannot learn to interpret angles it has never seen via low-rank residuals alone.

LoRA for Long-Context Adaptation

Why vanilla LoRA under-delivers on long context

Position interpolation: the prerequisite

Shifted sparse attention: making training feasible

Keep reading with Pro.