Applied LLMs
Why Parameter-Efficient Fine-Tuning
Parameter-efficient fine-tuning methods adapt large pretrained models to new tasks by training only a small fraction of parameters, making customisation practical without the compute and storage costs of full fine-tuning.
intermediate · 7 min read
GPT-3 has 175 billion parameters. A full fine-tuning run stores a complete copy of all those weights per task, requires gradient memory proportional to the entire network, and produces a 350 GB checkpoint. Run that for ten downstream tasks and you need 3.5 TB of model storage before you have written a single line of application code. This is the pressure that motivates parameter-efficient fine-tuning (PEFT).
PEFT is not a single algorithm. It is a design goal: achieve per-task adaptation quality close to full fine-tuning while touching as few parameters as possible. The methods that pursue this goal fall into three structural families: intrinsic-dimensionality adapters, additive soft-prompt methods, and low-rank weight perturbation (LoRA and its variants). Each family makes a different bet about where in the model adaptation signal actually lives.
The Memory Arithmetic That Makes Full Fine-Tuning Unaffordable
Training a model requires holding four things in GPU memory simultaneously: the forward-pass activations, the weights themselves, the gradients, and the optimiser states. With Adam, the optimiser states alone cost two floats per parameter (first and second moment estimates), so you are paying roughly four bytes per parameter just for the optimiser if weights are in fp32, or a mixed-precision overhead of roughly 12 bytes per parameter in the common bf16 weights + fp32 Adam regime.
For a 7B-parameter model that is roughly 84 GB for the optimiser states alone. Consumer hardware caps out far below that. PEFT attacks this by making the trainable-parameter count tiny. If you only train 0.1% of parameters, the optimiser states shrink by a factor of 1000.
The other cost is task storage. If you deploy 20 fine-tuned variants of the same base model, you need to store 20 full copies. PEFT collapses this: store one frozen base, then store 20 small adapter files whose combined size may be smaller than a single full checkpoint.
Adapter Modules: Insert, Freeze, Train
Houlsby et al. (2019) introduced the adapter pattern into transformers. The idea is mechanical: insert a small bottleneck sub-network after each attention and feed-forward block, freeze every original parameter, and train only the adapters. A typical adapter projects the hidden dimension d down to r << d, applies a non-linearity, then projects back up. With a bottleneck dimension r = 64 inside a model with d = 768, each adapter adds roughly 2 * d * r parameters, which is a small fraction of the block's original parameter count.
x --> [frozen attention block]
--> [adapter: Linear(d→r), GeLU, Linear(r→d), residual]
--> [frozen FFN block]
--> [adapter: Linear(d→r), GeLU, Linear(r→d), residual]
--> next layer
On the GLUE benchmark, adapters trained on BERT reached within 0.4% of full fine-tuning performance while adding only about 3.6% extra parameters per task. The trade-off is inference latency: the adapter layers are sequential, so every forward pass incurs extra matrix multiplications that cannot be folded away.
LoRA: Low-Rank Weight Perturbation Without Inference Overhead
Low-Rank Adaptation (LoRA), introduced by Hu et al. (2021), reframes the problem. Instead of inserting new layers, LoRA hypothesises that the weight update matrix itself has low intrinsic rank. For a frozen weight matrix W_0 of shape (d_out, d_in), it parameterises the update as:
W = W_0 + BA
where B is (d_out, r) and A is (r, d_in), with r << min(d_out, d_in). Only A and B are trained. W_0 is frozen.
The payoff is that at inference time you can merge the trained update back into the base weights: W_0 <- W_0 + BA. The resulting model is identical in structure to the original, so there is zero added latency. This is the property adapters cannot match.
Applied to GPT-3 175B, LoRA reduces the number of trainable parameters by roughly 10,000 times compared to full fine-tuning, while matching or exceeding full fine-tuning quality on several benchmarks. GPU memory required during training drops by roughly 3 times, and training throughput improves significantly.
Rank r is the main hyperparameter. Common practice: r = 4 or r = 8 for light domain adaptation, r = 16 to r = 64 for larger behavioural shifts. LoRA is typically applied to the query and value projection matrices in each attention layer, though applying it to all linear layers often helps on difficult tasks.
Prompt Tuning: Optimise the Input, Not the Weights
Prompt tuning (Lester et al., 2021) takes a different philosophical stance. Rather than touching the weights at all, it prepends a small set of learnable continuous vectors (soft prompt tokens) to the input sequence. Only these vectors are trained; the entire language model remains frozen.
input: [s1][s2][s3]...[sN]
with soft prompt: [p1][p2]...[pk][s1][s2]...[sN]
At large scales (beyond roughly 10B parameters), prompt tuning matches full fine-tuning quality. Below that scale the gap widens considerably. The parameter count is tiny: k * d_model floats, where k is the prompt length (typically 20-100 tokens). This makes prompt tuning especially attractive when the model itself cannot be modified at all (e.g., you are accessing it through an API with no weight access).
P-Tuning v2 (Liu et al., 2021) extended soft prompts to every transformer layer rather than only the input, which restored competitive performance on smaller models and on structured-prediction tasks, requiring only 0.1% to 3% of trainable parameters compared to full fine-tuning.
Choosing a Method: A Practical Heuristic
| Scenario | Recommended approach | Why |
|---|---|---|
| Single-GPU fine-tuning, inference speed matters | LoRA (merge at inference) | No latency penalty after merging |
| Many tasks, shared inference server | Adapters or LoRA (keep separate) | Small per-task delta, one base model |
| Frozen API model, only input accessible | Prompt tuning / in-context soft prompts | No weight access needed |
| Very large model, extreme memory constraint | LoRA with quantised base (QLoRA) | 4-bit base + bf16 adapters fits consumer GPU |
| Task requires fine-grained token-level labels | P-Tuning v2 (multi-layer prompts) | Input-only soft prompts underperform on NER/span tasks |
The HuggingFace PEFT library provides production-ready implementations of LoRA, adapters, prompt tuning, prefix tuning, and IA3 under a unified API, making method-switching low-cost in practice.
When It Falls Down
Rank mismatch with task complexity. LoRA with a small rank (r = 4) can fail to capture the weight change needed for a large domain shift. If your downstream task differs significantly from pretraining distribution (e.g., adapting a general LLM to radiology reports with domain-specific syntax), a very small rank budget may be insufficient, and the performance gap to full fine-tuning becomes material.
Prompt tuning brittleness at small scale. Below roughly 1B parameters, soft prompts are unreliable. The model does not have enough capacity to reinterpret frozen representations through learned prefixes, so performance is inconsistent and sensitive to initialisation. Do not use vanilla prompt tuning on sub-billion-parameter models without rigorous ablations.
Adapter latency compounds across layers. In serving environments with strict SLA requirements, adapter modules in every layer add measurable wall-clock latency, especially at small batch sizes where memory bandwidth matters more than compute throughput. The LoRA merge trick eliminates this, but only when you are not dynamically switching adapters per request.
Interference in multi-adapter serving. Serving multiple LoRA adapters from one base requires either weight merging (one adapter at a time, not dynamic) or a batched LoRA computation that is not as well optimised as a plain dense forward pass. Systems like S-LoRA address this but introduce their own complexity.
Catastrophic forgetting is not fully avoided. PEFT does not eliminate forgetting on previously learned tasks; it simply reduces the cost of training a separate checkpoint per task. If you need a single model that handles all tasks without task identifiers, PEFT alone is not the answer.
No guarantee of data efficiency. PEFT reduces parameter count but does not reduce the amount of fine-tuning data needed for strong performance. On very small datasets (fewer than a few hundred examples), even PEFT methods can overfit, and few-shot prompting may outperform any fine-tuning approach.
Further Reading
- Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685
- Houlsby, N. et al. (2019). Parameter-Efficient Transfer Learning for NLP. https://arxiv.org/abs/1902.00751
- Lester, B. et al. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. https://arxiv.org/abs/2104.08691
- Lialin, V. et al. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. https://arxiv.org/abs/2303.15647