← Concept library

Applied LLMs

LoRA for Long-Context Adaptation

Extending a model's context window via LoRA requires coordinating low-rank weight updates with position-encoding rescaling, and ignoring either side reliably degrades performance on long sequences.

advanced · 8 min read · Premium

A 7B model trained on 4 096-token sequences cannot simply be asked to process 32 000 tokens: the rotary position embeddings (RoPE) were never seen at angles beyond the training range, and the attention matrices grow quadratically with sequence length. Fine-tuning the full model to fix this costs tens of thousands of GPU-hours. LongLoRA (Chen et al., 2023) demonstrated that a carefully structured LoRA approach can extend Llama-2 7B to 100 000 tokens on a single 8×A100 node - a gap of roughly 25× - but only after identifying two non-obvious failure modes that vanilla LoRA misses entirely.

Why vanilla LoRA under-delivers on long context

Standard LoRA freezes all pretrained weights and learns two small matrices, \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\), added to each target projection:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied