Prompt Tuning with Soft Prompts

Feeding a frozen 11-billion-parameter T5 model a handful of learned token-sized vectors produces task performance that matches fully fine-tuning every weight in that model. That is the core empirical result of Lester et al. (2021), and it inverts the usual assumption that adaptation requires touching the model internals.

The practical consequence is stark. A single frozen model can serve hundreds of tasks concurrently, with each task represented by its own tiny "soft prompt" of roughly 100 vectors. Storage and serving costs collapse; the model weights never fork.

What a soft prompt actually is

A standard text prompt is a sequence of discrete tokens drawn from a vocabulary. Each token maps to a fixed embedding vector decided at tokenisation time. You cannot gradient-descend through a discrete choice, so manual prompts are optimised by trial and error.

A soft prompt replaces the prefix of the token sequence with a matrix of free-floating vectors, one per "virtual token", each the same dimension as the model's token embedding space. Call the soft prompt P with shape (k, d), where k is the number of virtual tokens and d is the embedding dimension. The actual input tokens are still embedded normally; the soft-prompt vectors are simply prepended in embedding space before the first transformer layer.

During training, only P is updated. The rest of the model, billions of parameters, is completely frozen. The gradient signal flows back through the frozen transformer blocks (whose weights do not change) and into P. This is computationally cheap because the frozen layers still need a forward pass, but the parameter count being optimised is k × d, typically around 20,000 numbers for a 100-token prompt on a 200-dim model, versus billions for full fine-tuning.

Input tokens:  [t1, t2, t3, ..., tn]
Embeddings:    [e1, e2, e3, ..., en]

With soft prompt (k=5 virtual tokens):
               [p1, p2, p3, p4, p5, e1, e2, ..., en]
                ↑__________________________↑
                Only these are trainable

The downstream label prediction head is the same frozen LM head; no new parameters are introduced there either.

Initialisation matters (more than you might expect)

Random initialisation works, but converges slowly and to worse minima. Two better strategies:

Initialisation	How	Why it helps
Vocabulary sampling	Sample k token embeddings at random from the existing embedding matrix	The soft prompt starts in the distribution the model already understands
Class label embeddings	Initialise each virtual token with the embedding of the output class label (e.g. "positive", "negative")	Injects task semantics from the start; especially good for classification

Lester et al. found that class-label initialisation consistently outperforms random initialisation, and vocabulary-sampled initialisation sits in between. The gap shrinks as model scale grows, but at smaller scales (below ~1B parameters) the initialisation choice is the difference between useful and useless.

The scale threshold: why size unlocks the method

The headline result from the paper is not that prompt tuning works, it is that prompt tuning is competitive with full fine-tuning only at sufficient scale. Below roughly 1 billion parameters, prompt tuning lags behind full fine-tuning by meaningful margins on SuperGLUE. At T5-XXL (11B), the gap disappears.

The intuition: a large model has already learnt extremely rich internal representations. The soft prompt does not need to teach the model new features; it needs only to activate and route the features that already exist. A small model does not have that repertoire, so no prefix can compensate.

This creates a practical constraint. If you are adapting a 7B model, prompt tuning may underperform LoRA or adapters. If you are adapting GPT-3 or T5-11B class models, it becomes genuinely competitive with full fine-tuning while being far cheaper to serve.

Relationship to prefix tuning

Prefix tuning (Li and Liang, 2021) predates soft-prompt tuning and shares the core idea: prepend learnable continuous vectors to a frozen model. The distinctions matter in practice:

Where vectors are inserted. Prefix tuning inserts trainable vectors at every transformer layer's key-value attention heads. Soft-prompt tuning inserts them only at the input embedding layer and lets the frozen layers propagate the signal. Prefix tuning has more expressive capacity but more parameters to train.
Training stability. Li and Liang found direct optimisation of the prefix vectors unstable and introduced a reparametrisation (a small MLP that generates the prefix) which is discarded after training. Lester et al. train the input-layer prompt vectors directly and report stable training without reparametrisation.
Parameter count. With k prefix tokens and L layers, prefix tuning scales as O(k × d × L); soft-prompt tuning scales as O(k × d), so it is significantly smaller.

P-Tuning v2 (Liu et al., 2022) revisits the "deep" prefix approach from a different angle: it applies learnable prompts at every layer but for NLU rather than generation tasks, and shows that with careful tuning this can match fine-tuning universally across scales, not just at 11B+.

When it falls down

Small and mid-sized models. Below roughly 1B parameters the method reliably underperforms full fine-tuning and often underperforms LoRA. If you cannot access a very large frozen model, soft-prompt tuning is the wrong tool.

Long-tail and complex reasoning tasks. Tasks requiring multi-step reasoning, code generation, or structured output with many constraints tend to benefit from deeper weight updates that soft prompts cannot provide. The prompt can steer, but cannot install new compositional capabilities.

Very short prompts fail to generalise. With fewer than about 20 virtual tokens, performance degrades noticeably on harder tasks. The representation capacity of the soft prompt is bounded by k × d, and too few tokens is too little capacity.

Catastrophic interference when reusing the same model backbone. In a multi-tenant serving scenario, different users sharing the same frozen backbone with different soft prompts run correctly only if the serving framework correctly isolates prompt matrices per request. A bug that mixes prompt vectors across requests produces silent corruption, not an error.

Interpretability is zero. The learned vectors do not correspond to any vocabulary tokens. You cannot read the soft prompt, diff it, or use it to understand what the model was told. For regulated domains where model behaviour must be auditable, this matters.

Gradient memory during training. Although no model weights are updated, backpropagation still traverses all frozen layers to compute the gradient with respect to P. For a very large model, this means activation memory during training is roughly the same as for full fine-tuning unless activation checkpointing is applied. The savings are in stored parameters, not necessarily in training memory.

What a soft prompt actually is

Initialisation matters (more than you might expect)

The scale threshold: why size unlocks the method

Relationship to prefix tuning

When it falls down

Further reading