Prefix Tuning

Fine-tuning GPT-2 on a table-to-text task changes every one of its 117 million parameters. Store three task-specific models and you have tripled your parameter budget. Scale to GPT-3 class models (175B parameters) and per-task fine-tuning becomes economically unreasonable for most teams. Prefix tuning (Li and Liang, 2021) cuts that cost by roughly 1000x: it trains only a small matrix of "virtual tokens" prepended to the input, leaving every weight in the base model untouched.

What a prefix actually is

In a standard transformer, each attention layer computes queries, keys, and values from the actual token sequence. Prefix tuning inserts a learned sequence of continuous vectors - the prefix - directly into the key-value (KV) tensors of every layer, not just the input embedding layer.

Concretely, if the model has L layers and the prefix length is m, you add m virtual key vectors and m virtual value vectors at each layer. Every real token can attend to these virtual tokens when computing its attention output, but the virtual tokens attend to nothing (they are not in the query stream). The model never "sees" the prefix as discrete text; it is just extra context in every attention head.

M : (s_t, a_t) --> (s_{t+1}, r_t)

Only P_k^l and P_v^l are trained; all weight matrices (W_k, W_v, W_q, and the FFN weights) are frozen.

Why optimise at every layer, not just the embedding

Prompt tuning (Lester et al., 2021) trains soft tokens only at the input embedding layer. Prefix tuning trains them at every transformer layer. The empirical gap is meaningful: shallow soft prompts can only influence the model by propagating through all subsequent layers passively, whereas deep prefixes inject task-specific context directly at each layer's attention step. This matters especially for shorter sequences and tasks where later layers carry most of the semantic work.

P-Tuning v2 (Liu et al., 2021) later extended this same principle to BERT-style encoder models and sequence labelling tasks, demonstrating that deep prompt insertion is the key ingredient - not model architecture or task type.

The reparameterisation trick

Directly optimising the prefix parameters is numerically unstable in practice. Li and Liang found that gradient updates to raw P_k^l and P_v^l tensors led to erratic loss curves. Their fix: reparameterise through a small feed-forward network (FFN) during training.

for each real step:
    observe (s, a, r, s')
    update M with (s, a, r, s')
    for k in range(K):
        s_sim  = sample from replay buffer
        a_sim  = policy(s_sim)
        r_sim, s'_sim = M(s_sim, a_sim)
        update policy/value with (s_sim, a_sim, r_sim, s'_sim)

E and FFN_theta are the actual trainable objects. After training, you materialise P_l = FFN_theta(E) for every layer, discard the FFN, and store only the resulting prefix matrices. At inference the FFN is gone; only the prefix KV tensors remain. This is purely a training-time stabiliser, not a permanent architectural addition.

The total parameter count for a GPT-2 prefix of length 10 is on the order of 20,000 parameters (10 keys + 10 values, per layer, across 12 layers, at dimension 64 per head). GPT-2 has 117M parameters. The ratio is roughly 0.017%.

How it compares to adjacent methods

Method	Where parameters live	Layers touched	Typical param count
Full fine-tuning	All weights	All	100%
Adapter layers	New bottleneck modules between sublayers	All	0.5-8%
Prompt tuning	Input embeddings only	1 (input)	<0.1%
Prefix tuning	KV tensors at every layer	All	~0.1%
LoRA	Low-rank delta matrices on attention weights	Selected	0.1-1%

Prefix tuning sits between prompt tuning (cheapest, weakest) and adapters (more expensive, more expressive). LoRA is now generally preferred over prefix tuning for its better gradient flow and wider hardware support, but prefix tuning remains instructive because it exposes the minimal sufficient mechanism: you only need to perturb attention context, not weight matrices, to adapt a frozen model.

When it falls down

Short contexts and tight length budgets. The prefix consumes sequence length. If you allocate 100 prefix tokens and your model's context window is 512, you have already spent 20% of your budget before the actual input arrives. For tasks with long documents this is a real constraint.

Token-level tasks. Sequence labelling (named entity recognition, part-of-speech tagging) requires per-token predictions conditioned on precise local context. Prefix tuning was originally validated on generation tasks (table-to-text, summarisation). P-Tuning v2 addressed this gap, but plain prefix tuning as described in the original paper under-performs fine-tuning on token-level NLU, particularly at smaller model scales (under 1B parameters).

Underperformance below 1B parameters. Li and Liang's results on GPT-2-medium (345M) show a clear gap versus full fine-tuning on low-data regimes for some datasets. The method's competitiveness largely kicks in at GPT-2-XL (1.5B) and beyond. Prompt tuning (Lester et al.) makes the same observation: soft-prompt methods become competitive with full fine-tuning only as scale increases.

Catastrophic forgetting is traded for restricted expressiveness. The flip side of frozen weights is that the model cannot update task-irrelevant bad inductive biases. If the pretrained model has a strong prior that conflicts with the target task distribution, the prefix has limited capacity to override it.

No weight sharing across tasks at inference. Each task needs its own prefix KV tensors loaded into the KV cache. This is cheaper than maintaining separate full models, but serving N tasks simultaneously requires N sets of prefix tensors cached across all layers. At very large N and large model depth, memory pressure re-enters.

What a prefix actually is

Why optimise at every layer, not just the embedding

The reparameterisation trick

How it compares to adjacent methods

When it falls down

Further reading