LoRA vs Full Fine-Tuning

Fine-tuning GPT-3 (175B parameters) in full precision requires storing gradient and optimiser state for every single weight: roughly 2.8 TB of GPU memory at float32. That number alone explains why most teams never attempted it. LoRA (Low-Rank Adaptation) changed the calculus by reducing trainable parameters by up to 10,000x, but the technique is not free of trade-offs. Understanding exactly what each approach does, and where each breaks down, is the most useful mental model you can build for applied fine-tuning work.

What full fine-tuning actually does

In full fine-tuning you start from a pretrained checkpoint W and update every weight matrix through standard gradient descent. After training, the checkpoint contains W + DW, where DW is a dense matrix of the same shape as W. For a single 4096x4096 attention projection in a 7B model, that is 16.7 million floats. Multiply across all layers and you see why checkpoint sizes double and GPU memory balloons.

The key property of full fine-tuning is that DW is unconstrained. The model can, in principle, learn any update that gradient descent finds useful, up to the capacity limit imposed by your data and learning rate. That flexibility is a genuine advantage for tasks that are very far from the pretraining distribution, or where the signal is dense enough to push many directions at once.

The cost is not just memory at training time. Serving multiple fine-tuned variants means storing separate full copies of the model. A team with 20 task-specific variants of a 70B model needs 20 complete checkpoints.

The LoRA constraint: why it works

LoRA (Hu et al., 2021) freezes all original weights and, for selected weight matrices W in R^(d x k), adds a side path:

output = W * x  +  (B * A) * x * (alpha / r)

where A is in R^(r x k), B is in R^(d x r), and r << min(d, k). A is initialised from a Gaussian; B is initialised to zero so the adapter contributes nothing at the start of training. Only A and B are updated.

The ratio alpha/r acts as a learning rate scale. A common default is alpha = 2r, which keeps the effective scale at 2.0 regardless of rank. The total parameter count for one matrix becomes 2 * r * (d + k) instead of d * k. At r = 8 on a 4096x4096 matrix, that is 65,536 parameters vs. 16,777,216: a 256x reduction.

Why does this not destroy quality? The intrinsic dimension hypothesis (Aghajanyan et al., 2020) provides the theoretical backing: pretrained language models have a very low intrinsic dimension for most downstream tasks. The weight updates needed for fine-tuning live in a much smaller subspace than the full parameter space. A rank-8 or rank-16 update can capture that subspace well enough for most tasks.

The practical consequence is that LoRA adapters are tiny (often 10-100 MB vs. 10-100 GB for the base model), can be swapped in and out of a frozen base at inference time, and can be fused back into the weights with a single addition when you need zero-latency production deployment.

Practical comparison across the dimensions that matter

Dimension	Full fine-tuning	LoRA
Trainable parameters	All (100%)	~0.1-1%
GPU memory (7B, bfloat16)	~120 GB (AdamW)	~16 GB (r=16)
Checkpoint delta size	Full model copy	Tens of MB
Multi-task serving	Separate copies	Shared base + adapters
Tasks far from pretraining	Best	Worse at very low rank
Catastrophic forgetting	More risk at high LR	Less risk (frozen base)
Inference latency	Identical to base	Zero overhead if merged

QLoRA (Dettmers et al., 2023) pushes LoRA further by quantising the frozen base model to 4-bit NF4, reducing the footprint to roughly 5 GB for a 7B model. The adapter weights themselves remain in bfloat16. This makes it possible to fine-tune a 65B model on a single 48 GB GPU, something that was practically impossible with full fine-tuning at consumer scale.

Where the rank constraint actually matters

Choosing rank r is the central decision in LoRA. Too low and the adapter cannot represent the required update; too high and you recover most of the cost of full fine-tuning without much extra quality.

Empirically:
- r = 4 to 8 covers most instruction-following or domain-adaptation tasks.
- r = 16 to 64 is appropriate for code, mathematics, or significant style change.
- r >= 128 approaches full fine-tuning cost at marginal quality gain in most benchmarks.

The alpha/r scaling interacts with your learning rate. Many practitioners hold alpha = 2r and tune the learning rate in (1e-4, 5e-4) for 7B models. Changing both simultaneously makes it hard to attribute effects.

Another practical question: which modules to target? The original paper applies LoRA only to the query and value projection matrices (Wq, Wv). Later work, including ablations in the QLoRA paper, shows that targeting all four attention projections (Wq, Wk, Wv, Wo) and the MLP blocks yields consistent improvement at modest extra cost.

When it falls down

Tasks requiring broad weight redistribution. LoRA's subspace is fixed by the choice of which weight matrices you target and the rank. If the target task requires updates that are not low-rank relative to those matrices, LoRA will underfit. This is most visible in tasks that are structurally novel rather than semantically close to pretraining. Continual learning across many tasks can also expose the limitation: each adapter occupies a fixed subspace, and interference between tasks grows as you add more.

Very long fine-tuning runs. The frozen base weights do not adapt. If the pretraining distribution was genuinely poor for your domain (rare languages, niche technical notation, specialised reasoning patterns), no amount of LoRA training on the adapters will compensate, because the representations that the frozen layers produce are the input to every adapter.

Catastrophic forgetting is still possible, just less obvious. Because the base is frozen, LoRA does not overwrite pretrained behaviour. But if you merge the adapter back into the base weights and then do further fine-tuning, the merged model is fully mutable again. Teams that iterate this way inadvertently expose themselves to the same forgetting risks as full fine-tuning.

Quantisation-induced approximation error compounds. In QLoRA, the base model is dequantised on the fly during each forward pass. The NF4 approximation introduces a small but nonzero error. For most tasks this is negligible, but for tasks requiring very precise numerical output (e.g. arithmetic, table-to-text with exact figures), the cumulative approximation can degrade output quality in ways that are hard to diagnose.

Rank mismatch at merge time. If you train with a rank that is too low and then merge and evaluate, the evaluation result reflects the merged approximation, not the ideal update. Teams sometimes see a gap between training-time adapter perplexity and post-merge evaluation scores. If the gap is large, rank is usually the first thing to revisit.

The "rank-1 catastrophe" at very low rank. At r = 1, the adapter learns a single outer product: one direction in output space scaled by one direction in input space. Some tasks exhibit near-random performance at r = 1 even when r = 4 works well. This is not always obvious from validation loss alone; prefer a task-specific evaluation metric.

What full fine-tuning actually does

The LoRA constraint: why it works

Practical comparison across the dimensions that matter

Where the rank constraint actually matters

When it falls down

Further reading