Applied LLMs
LoRA for Long-Context Adaptation
Extending a model's context window via LoRA requires coordinating low-rank weight updates with position-encoding rescaling, and ignoring either side reliably degrades performance on long sequences.
advanced · 8 min read · Premium
A 7B model trained on 4 096-token sequences cannot simply be asked to process 32 000 tokens: the rotary position embeddings (RoPE) were never seen at angles beyond the training range, and the attention matrices grow quadratically with sequence length. Fine-tuning the full model to fix this costs tens of thousands of GPU-hours. LongLoRA (Chen et al., 2023) demonstrated that a carefully structured LoRA approach can extend Llama-2 7B to 100 000 tokens on a single 8×A100 node - a gap of roughly 25× - but only after identifying two non-obvious failure modes that vanilla LoRA misses entirely.
Why vanilla LoRA under-delivers on long context
Standard LoRA freezes all pretrained weights and learns two small matrices, \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\), added to each target projection:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.