← Concept library

Applied LLMs

The Memory Maths of Fine-Tuning

Fine-tuning a 7B-parameter model with full gradients and Adam state consumes roughly 112 GB of GPU memory; parameter-efficient methods cut that by an order of magnitude by training only a tiny fraction of weights.

intermediate · 8 min read

A 7B-parameter model stored in float32 occupies about 28 GB just for the weights. Full fine-tuning with Adam adds gradients (another 28 GB) plus two momentum buffers (56 GB more), totalling roughly 112 GB before you allocate a single activation. That is four A100s for one model that runs comfortably on one for inference. Parameter-efficient fine-tuning (PEFT) exists to close that gap.

Where the memory actually goes

Training memory has four distinct residents:

Component Size (float32) Notes
Model weights 4 bytes × P Frozen in PEFT; stored in lower precision where possible
Gradients 4 bytes × P_trainable Only over trainable parameters
Optimiser state 8 bytes × P_trainable Adam: first and second moments
Activations Varies with batch & sequence Checkpoint to trade compute for memory

With full fine-tuning P_trainable = P. With LoRA on a 7B model, P_trainable might be 0.01 % of P, so the gradient and optimiser columns shrink by a factor of roughly 10,000. That is the numerical core of the PEFT argument.

LoRA: low-rank decomposition of weight updates

The insight behind LoRA (Hu et al., 2021) is that the weight update matrix dW accumulated during fine-tuning has low intrinsic rank. Rather than learning dW directly, LoRA parameterises it as a product of two thin matrices:

W' = W + dW = W + B @ A

where W is (d_out, d_in), A is (r, d_in), and B is (d_out, r), with rank r much smaller than d_in or d_out. Typical values: r = 4, 8, or 16 against d_in ~ 4096. W stays frozen. Only A and B are trained.

Parameter count for a single weight matrix:

Full update:  d_out × d_in   = 4096 × 4096 = 16,777,216
LoRA (r=8):   r × (d_in + d_out) = 8 × 8192 =     65,536
Compression:  ~256×

A and B are initialised so that B @ A = 0 at the start of training (A is random Gaussian; B is zero), meaning the pretrained model behaviour is preserved at step 0. A scaling factor alpha/r is applied to B @ A at inference time, giving one hyperparameter to tune without retraining.

At inference, B @ A can be merged back into W, leaving zero overhead compared to the base model. This is a key advantage over adapter-based approaches, which insert serial bottleneck layers that add latency on every forward pass.

QLoRA: quantising the frozen base

LoRA still needs the frozen base weights in memory. For a 70B model in float16, that is 140 GB, which is two A100s before any trainable parameters. QLoRA (Dettmers et al., 2023) addresses this by storing the base weights in 4-bit NormalFloat (NF4), a data type designed to minimise quantisation error for normally distributed weights. Key mechanics:

  • NF4 quantisation: values mapped to the 16 points that minimise expected quantisation error under a normal distribution. The base stays in 4-bit on disk and in memory.
  • Double quantisation: the quantisation constants themselves are quantised, saving roughly 0.5 bits per parameter more.
  • Paged optimisers: GPU optimiser states are paged to CPU RAM when GPU memory is under pressure, using NVIDIA's unified memory mechanism.
  • Dequantisation to bfloat16 happens on-the-fly in CUDA kernels during the forward pass.

The result: a 65B model fine-tunable on a single 48 GB GPU, with performance matching full 16-bit fine-tuning on the Guanaco benchmark. Memory for the base drops by roughly 4x compared to float16 (140 GB to ~35 GB for 70B).

Adapters and prompt tuning: alternative budgets

Adapter modules (Houlsby et al., 2019) insert small feed-forward bottleneck layers inside each transformer block, trained while the rest is frozen. They achieve similar parameter counts to LoRA but add serial computation at inference, making them slower unless the adapter layers can be batched efficiently.

Prompt tuning (Lester et al., 2021) takes a different angle: instead of modifying weights at all, it prepends a small set of learnable "soft prompt" tokens to the input sequence and trains only those embeddings. Trainable parameters drop to perhaps 0.001 % of the model. The catch is that the method performs poorly at small scale; below roughly 10B parameters, full fine-tuning beats it significantly. At 100B+ parameters the gap closes.

A rough comparison across methods for a 7B base model:

Method Trainable params Inference overhead Works at 7B?
Full fine-tuning 100 % None Yes (memory intensive)
LoRA (r=8) ~0.1 % None (merge) Yes
QLoRA (4-bit + r=8) ~0.1 % Minor (dequant) Yes (low VRAM)
Adapters ~0.5-3 % +latency per layer Yes
Prompt tuning ~0.001 % +token overhead Marginal

DoRA: decomposing magnitude and direction

A structural observation motivates DoRA (Liu et al., 2024): full fine-tuning tends to update both the magnitude and direction of weight vectors, but LoRA implicitly couples them through the B @ A product. DoRA decomposes each weight column into a scalar magnitude m and a unit-norm direction vector v, then applies LoRA only to the directional component:

W'_col = m * (v + B @ A) / ||v + B @ A||

The magnitude m is a learnable scalar per column; B and A are the same LoRA matrices. This gives the optimiser separate control over how much a weight changes (magnitude) versus in which direction, reportedly improving convergence and downstream task quality with no inference overhead beyond standard LoRA.

When it falls down

Rank underfit. If the target task is genuinely far from the pretraining distribution (e.g., fine-tuning an English LLM for a low-resource language with novel morphology), a small rank may simply not have enough capacity to represent the required update. Increasing r helps but erodes the memory savings.

Quantisation degradation in QLoRA. The 4-bit base is not lossless. On tasks requiring precise numerical reasoning or generation of structured data (code, JSON), QLoRA sometimes underperforms float16 fine-tuning by a measurable margin even when aggregate benchmarks look equivalent.

Layer selection matters. LoRA is commonly applied to the attention weight matrices (q_proj, v_proj). Omitting the MLP layers or the embedding tables can leave significant adaptation capacity on the table for some tasks. There is no universal recipe; tuning which layers to adapt is itself a hyperparameter search.

Catastrophic forgetting is not eliminated. PEFT methods reduce but do not eliminate the risk of forgetting pretraining knowledge. With very small rank or very large learning rates, the learnable matrices can still distort generalisation on held-out tasks.

Multi-adapter serving complexity. A major practical appeal of LoRA is the ability to serve many fine-tuned variants from a single base model by swapping or blending adapter weights. This works cleanly only if all adapters share the same base model version and quantisation scheme. Version drift across adapter checkpoints is a real operational headache.

Further reading

  • Hu et al. (2021), "LoRA: Low-Rank Adaptation of Large Language Models" - https://arxiv.org/abs/2106.09685
  • Dettmers et al. (2023), "QLoRA: Efficient Finetuning of Quantized LLMs" - https://arxiv.org/abs/2305.14314
  • Houlsby et al. (2019), "Parameter-Efficient Transfer Learning for NLP" - https://arxiv.org/abs/1902.00751
  • Lester et al. (2021), "The Power of Scale for Parameter-Efficient Prompt Tuning" - https://arxiv.org/abs/2104.08691
Sign in to save and react.
Share Copied