Applied LLMs
Prefix Tuning
Prefix tuning freezes all pretrained model weights and instead optimises a small set of continuous, task-specific vectors prepended to every layer's key-value cache, achieving within a few points of full fine-tuning while training roughly 0.1% of the original parameters.
intermediate · 7 min read
Fine-tuning GPT-2 on a table-to-text task changes every one of its 117 million parameters. Store three task-specific models and you have tripled your parameter budget. Scale to GPT-3 class models (175B parameters) and per-task fine-tuning becomes economically unreasonable for most teams. Prefix tuning (Li and Liang, 2021) cuts that cost by roughly 1000x: it trains only a small matrix of "virtual tokens" prepended to the input, leaving every weight in the base model untouched.
What a prefix actually is
In a standard transformer, each attention layer computes queries, keys, and values from the actual token sequence. Prefix tuning inserts a learned sequence of continuous vectors - the prefix - directly into the key-value (KV) tensors of every layer, not just the input embedding layer.
Concretely, if the model has L layers and the prefix length is m, you add m virtual key vectors and m virtual value vectors at each layer. Every real token can attend to these virtual tokens when computing its attention output, but the virtual tokens attend to nothing (they are not in the query stream). The model never "sees" the prefix as discrete text; it is just extra context in every attention head.
Layer l attention (with prefix):
K_l = [P_k^l ; W_k X] # prefix keys concatenated with real keys
V_l = [P_v^l ; W_v X] # prefix values concatenated with real values
Q_l = W_q X # queries from real tokens only
out = softmax(Q_l K_l^T / sqrt(d)) V_l
Only P_k^l and P_v^l are trained; all weight matrices (W_k, W_v, W_q, and the FFN weights) are frozen.
Why optimise at every layer, not just the embedding
Prompt tuning (Lester et al., 2021) trains soft tokens only at the input embedding layer. Prefix tuning trains them at every transformer layer. The empirical gap is meaningful: shallow soft prompts can only influence the model by propagating through all subsequent layers passively, whereas deep prefixes inject task-specific context directly at each layer's attention step. This matters especially for shorter sequences and tasks where later layers carry most of the semantic work.
P-Tuning v2 (Liu et al., 2021) later extended this same principle to BERT-style encoder models and sequence labelling tasks, demonstrating that deep prompt insertion is the key ingredient - not model architecture or task type.
The reparameterisation trick
Directly optimising the prefix parameters is numerically unstable in practice. Li and Liang found that gradient updates to raw P_k^l and P_v^l tensors led to erratic loss curves. Their fix: reparameterise through a small feed-forward network (FFN) during training.
P_l = FFN_theta(E) # E is a smaller embedding matrix
E and FFN_theta are the actual trainable objects. After training, you materialise P_l = FFN_theta(E) for every layer, discard the FFN, and store only the resulting prefix matrices. At inference the FFN is gone; only the prefix KV tensors remain. This is purely a training-time stabiliser, not a permanent architectural addition.
The total parameter count for a GPT-2 prefix of length 10 is on the order of 20,000 parameters (10 keys + 10 values, per layer, across 12 layers, at dimension 64 per head). GPT-2 has 117M parameters. The ratio is roughly 0.017%.
How it compares to adjacent methods
| Method | Where parameters live | Layers touched | Typical param count |
|---|---|---|---|
| Full fine-tuning | All weights | All | 100% |
| Adapter layers | New bottleneck modules between sublayers | All | 0.5-8% |
| Prompt tuning | Input embeddings only | 1 (input) | <0.1% |
| Prefix tuning | KV tensors at every layer | All | ~0.1% |
| LoRA | Low-rank delta matrices on attention weights | Selected | 0.1-1% |
Prefix tuning sits between prompt tuning (cheapest, weakest) and adapters (more expensive, more expressive). LoRA is now generally preferred over prefix tuning for its better gradient flow and wider hardware support, but prefix tuning remains instructive because it exposes the minimal sufficient mechanism: you only need to perturb attention context, not weight matrices, to adapt a frozen model.
When it falls down
Short contexts and tight length budgets. The prefix consumes sequence length. If you allocate 100 prefix tokens and your model's context window is 512, you have already spent 20% of your budget before the actual input arrives. For tasks with long documents this is a real constraint.
Token-level tasks. Sequence labelling (named entity recognition, part-of-speech tagging) requires per-token predictions conditioned on precise local context. Prefix tuning was originally validated on generation tasks (table-to-text, summarisation). P-Tuning v2 addressed this gap, but plain prefix tuning as described in the original paper under-performs fine-tuning on token-level NLU, particularly at smaller model scales (under 1B parameters).
Underperformance below 1B parameters. Li and Liang's results on GPT-2-medium (345M) show a clear gap versus full fine-tuning on low-data regimes for some datasets. The method's competitiveness largely kicks in at GPT-2-XL (1.5B) and beyond. Prompt tuning (Lester et al.) makes the same observation: soft-prompt methods become competitive with full fine-tuning only as scale increases.
Catastrophic forgetting is traded for restricted expressiveness. The flip side of frozen weights is that the model cannot update task-irrelevant bad inductive biases. If the pretrained model has a strong prior that conflicts with the target task distribution, the prefix has limited capacity to override it.
No weight sharing across tasks at inference. Each task needs its own prefix KV tensors loaded into the KV cache. This is cheaper than maintaining separate full models, but serving N tasks simultaneously requires N sets of prefix tensors cached across all layers. At very large N and large model depth, memory pressure re-enters.
Further reading
- Li, X. L. and Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." arXiv:2101.00190. https://arxiv.org/abs/2101.00190
- Lester, B., Al-Rfou, R., and Constant, N. (2021). "The Power of Scale for Parameter-Efficient Prompt Tuning." arXiv:2104.08691. https://arxiv.org/abs/2104.08691
- Liu, X. et al. (2021). "P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks." arXiv:2110.07602. https://arxiv.org/abs/2110.07602
- HuggingFace PEFT documentation - Soft prompts conceptual guide. https://huggingface.co/docs/peft/conceptual_guides/prompting