LoRA: Low-Rank Adaptation

Fine-tuning GPT-3 175B with Adam stores a full set of gradient and optimiser states: roughly 1.2 TB of GPU memory just for the parameter updates. For most practitioners, that number ends the conversation before it starts. LoRA (Low-Rank Adaptation) is the paper that changed the calculus: the same GPT-3 can be adapted with fewer than 4.7 million trainable parameters, a reduction of roughly 10,000x, with no additional cost at inference time.

The core idea: weight updates live in a low-rank subspace

LoRA's motivation comes from a hypothesis that Aghajanyan et al. (2021) formalised: fine-tuning a large pretrained model is, in practice, a low-dimensional optimisation. The model's representation space is so well-structured after pretraining that the task-specific update rarely needs the full rank of the weight matrix to be useful.

Given a pretrained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), LoRA constrains the update \(\Delta W\) to a product of two much smaller matrices:

\[\Delta W = BA, \quad B \in \mathbb{R}^{d \times r},\ A \in \mathbb{R}^{r \times k},\ r \ll \min(d, k)\]

During training, \(W_0\) is frozen. Only \(A\) and \(B\) receive gradients. The modified forward pass for a linear layer is:

h = W_0 x + (B A) x * (alpha / r)

where alpha is a scaling hyperparameter (typically set equal to r, or kept fixed at a value like 16 while r is swept). The scaling term alpha/r keeps the effective learning rate of the adapter stable across different rank choices.

\(A\) is initialised with a random Gaussian; \(B\) is initialised to zero. This ensures \(\Delta W = 0\) at the start of training, so the model behaves identically to the base model on step zero.

Which matrices to adapt, and what rank to use

Hu et al. applied LoRA to the query and value projection matrices of every attention layer in the Transformer. In practice, adapting all four attention projections (Q, K, V, O) tends to give slightly better results at the same total parameter budget.

Rank `r`	Trainable params (GPT-3, Q+V only)	Typical use case
1	~0.3 M	Extreme compression; style transfer
4	~1.2 M	Often sufficient for domain adaptation
8	~2.4 M	Common default; good quality/size trade-off
64	~19 M	Approaches full fine-tuning quality on harder tasks

The "right" rank is task-dependent. For narrow tasks with limited data, r = 4 or r = 8 routinely matches full fine-tuning. For instruction tuning across a broad range of tasks, higher ranks or adapting more matrix types becomes necessary.

Merging weights at inference time

One of LoRA's most useful properties is that the adapter can be folded back into the base model before deployment:

\[W_{\text{merged}} = W_0 + BA \cdot \frac{\alpha}{r}\]

After this one-time merge, the model is architecturally identical to the original: no extra matrix multiplications, no extra memory, no branching logic in the forward pass. Inference latency and throughput are unchanged. When you need to swap tasks (e.g., serving multiple LoRA checkpoints on a single GPU), you can unmerge and re-merge a different adapter in milliseconds, or keep adapters separate and add them dynamically at forward-pass time using libraries such as HuggingFace PEFT.

Extensions worth knowing

QLoRA (Dettmers et al., 2023) stacks quantisation on top of LoRA. The base model is stored in 4-bit NormalFloat (NF4), while the LoRA adapters remain in full precision (bfloat16). Gradients flow through a frozen, dequantised base model and update only the adapters. The result: a 65B-parameter model fits on a single 48 GB GPU with negligible quality loss. QLoRA introduced double quantisation (quantising the quantisation constants themselves) and paged optimisers to handle memory spikes.

Rank-Stabilised LoRA (rsLoRA) changes the scaling from alpha/r to alpha/sqrt(r). This prevents the adapter's effective magnitude from shrinking as rank increases, which makes high-rank LoRA training more stable and removes a hyperparameter sensitivity.

LoRA+ and AdaLoRA take different angles on allocation: LoRA+ sets separate learning rates for \(A\) and \(B\) (finding that \(B\) benefits from a higher rate), while AdaLoRA distributes the rank budget across layers adaptively using singular value decomposition, pruning low-importance singular values during training.

A minimal PEFT setup with HuggingFace looks like this:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 2,359,296 || all params: 6,740,549,632 || trainable%: 0.035

The lora_alpha/r ratio controls the effective step size of the adapter update; a ratio of 2 (e.g., alpha=16, r=8) is a common starting point.

When it falls down

Rank underestimation on hard tasks. For tasks that require learning genuinely new knowledge rather than steering existing representations (e.g., a model that has never seen a specialised scientific vocabulary), small ranks lose meaningful signal. Full fine-tuning or higher-rank adapters may be necessary.

Layer coverage matters. Applying LoRA only to attention projections and ignoring MLP layers leaves a significant fraction of each layer's representational capacity unaddressed. Papers on instruction tuning have shown gains from adding LoRA to MLP feed-forward weights, especially for reasoning-heavy tasks.

Scaling alpha carelessly. Changing r without adjusting alpha changes the effective learning rate of the adapter, which can cause training instability or very slow convergence. rsLoRA mitigates this, but vanilla LoRA configurations copied from a tutorial without re-tuning alpha often underperform.

Catastrophic forgetting is not eliminated. LoRA reduces forgetting relative to full fine-tuning, but it does not remove it. If the fine-tuning distribution is very narrow, the adapter can overfit while the base model's general capabilities degrade slightly when the weights are merged.

Adapter composition is non-trivial. Merging multiple LoRA adapters trained on different tasks (via weighted averaging) often works, but there is no guarantee: adapters trained with different base models, different ranks, or on conflicting distributions can produce interference. Methods like TIES-Merging and DARE address this but add complexity.

Not free for all architectures. LoRA is clean to apply to linear projection layers. Applying it to convolutional layers, embedding tables, or normalisation layers is possible but less standard and requires care about which dimensions to decompose.

The core idea: weight updates live in a low-rank subspace

Which matrices to adapt, and what rank to use

Merging weights at inference time

Extensions worth knowing

When it falls down

Further reading