Speculative Decoding

Decoder inference is sequential by definition: token t+1 depends on t. That makes a single decode step memory-bandwidth-bound (you read the whole model and KV cache to emit one token), so the GPU sits idle on tensor cores most of the time. Speculative decoding turns that idle compute into a speed win by letting a small cheap "draft" model guess several tokens ahead, then having the big "verifier" model check all of them in a single forward pass. Done right, the output distribution is identical to the verifier running alone.

The protocol

A draft model q autoregressively proposes k candidate tokens (typically k = 4-8).
The verifier model p runs a single forward pass over the prefix plus those k tokens, producing p-distributions at each position in parallel.
For each proposed token x_i, accept it with probability min(1, p(x_i)/q(x_i)). On rejection, sample a corrected token from the adjusted distribution (p - q)_+ (Leviathan et al. 2023).
Append the accepted tokens (plus the correction) and loop.

Because step 3 follows the importance-sampling rule, the marginal distribution of accepted sequences is exactly p - no quality loss, even though you ran the cheap model most of the time.

Why it is a win

The verifier processes k+1 positions in one forward pass at roughly the same cost as one position (decode is bandwidth-bound, and you read the weights and cache once either way). If the average acceptance length is L, your speedup over plain decoding is L / (1 + L * c) where c is the cost ratio of draft to verifier. Typical numbers:

Setup	Draft	Verifier	Acceptance	Speedup
Leviathan 2023 (T5-XXL)	T5-small	T5-XXL	~4.5 tokens	2-3x
Chen 2023 (Chinchilla 70B)	4B Chinchilla	70B Chinchilla	~3 tokens	2.0-2.5x
Medusa-2 on Vicuna-7B	n/a (heads)	Vicuna-7B	~2.5 tokens	2.2-3.6x

The acceptance rate, not the draft model's speed, is the dominant factor. A faster draft that disagrees more often is a worse choice than a slower one that agrees more.

Medusa: single-model variant

Training a separate draft model is awkward in production: you need to keep two models in sync across fine-tunes, and serving infrastructure must hold both. Medusa (Cai et al., 2024) sidesteps this by adding k extra prediction heads to the verifier itself. Each head predicts the token at position t + i + 1 directly from the hidden state at position t. You then verify the candidate set using tree attention in a single forward pass of the base model.

Medusa-1 trains only the heads; Medusa-2 jointly fine-tunes the backbone. The trade-off is straightforward: you get speculation without a second model, at the cost of slightly more parameters and a quality dip if the heads are under-trained.

The protocol

Why it is a win

Medusa: single-model variant

Keep reading with Pro.