← Blog

Speculative Decoding: How a Small Draft Model Makes Large Language Models Think Faster

June 04, 2026 · 24 min read

When GPT-3 answers a question, it reads its entire 175 billion parameters from memory for every single token it produces. For the sentence "The square root of seven is approximately 2.646," that means six separate round trips through hundreds of gigabytes of weights. The token "seven" (predictable from context) costs exactly as much wall-clock time as "2.646" (which requires actual computation). This is the fundamental inefficiency that speculative decoding attacks: not all tokens are equally hard, but autoregressive decoding treats them as if they are.

Why this matters: LLM inference is memory-bandwidth-bound, not compute-bound. Modern GPUs can perform hundreds of arithmetic operations for every byte they read, yet standard decoding uses a fraction of that capacity. Speculative decoding reclaims those wasted FLOPS by running a cheap draft model in parallel with the expensive target model, producing multiple tokens per forward pass while provably preserving the exact output distribution.

TL;DR

  • Standard autoregressive decoding generates one token per forward pass through the full model, leaving most GPU compute idle because inference is bottlenecked by memory bandwidth, not arithmetic throughput.
  • Speculative decoding uses a small, fast draft model to propose \(\gamma\) candidate tokens, then verifies all of them in a single forward pass of the target model using a modified rejection sampling scheme.
  • The rejection sampling algorithm guarantees the output distribution is mathematically identical to standard decoding; this is not an approximation.
  • Acceptance rates of 60-80% are typical for well-matched draft-target pairs, yielding 2-3x wall-clock speedups with no quality loss.
  • Tree-structured speculation (SpecInfer, Sequoia) extends the chain to a branching tree of candidates, letting one verification pass accept the longest matching path among multiple alternatives.
  • Self-speculative methods (Medusa, EAGLE, layer-skipping) eliminate the need for a separate draft model entirely by using the target model's own internal representations to predict future tokens.
  • Production frameworks (vLLM, TensorRT-LLM, SGLang) now ship native speculative decoding support, and Google reports using it at scale in products like AI Overviews.
  • The technique struggles at high batch sizes where the inference regime shifts toward compute-bound; current research (MagicDec, SPIRe) is closing this gap.

At a Glance

flowchart LR
    A["Input prompt"] --> B["Draft model<br/>proposes gamma tokens"]
    B --> C["Target model<br/>scores all gamma + 1<br/>positions in parallel"]
    C --> D{"Rejection<br/>sampling"}
    D -->|"Accept"| E["Append accepted<br/>tokens to sequence"]
    D -->|"Reject at position k"| F["Resample token k<br/>from adjusted distribution"]
    E --> G["Output: 1 to gamma+1<br/>tokens per step"]
    F --> G

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class A blue
    class B,C purple
    class D amber
    class E,G emerald
    class F teal

[IMAGE: Side-by-side comparison of standard autoregressive decoding (one token per step, sequential) vs. speculative decoding (draft proposes 5 tokens, target verifies in one pass, 4 accepted), with a wall-clock timeline showing the latency difference]

Before Speculative Decoding

The idea that autoregressive generation is unnecessarily sequential did not arrive overnight. The path from "one token at a time" to "verify a batch of guesses" passed through several waypoints.

timeline
    title Evolution of Parallel Token Generation
    2017 : Transformer architecture introduced, attention is all you need
    2018 : Stern et al. propose blockwise parallel decoding for greedy generation
    2019 : Non-autoregressive translation models explore parallel generation with quality tradeoffs
    2022 : Leviathan et al. introduce speculative decoding with stochastic guarantees
    2023 : Chen et al. demonstrate speculative sampling at 70B scale with Chinchilla
    2024 : Tree-based methods (SpecInfer, Sequoia) and self-draft methods (Medusa, EAGLE) mature
    2025 : Production frameworks ship native support and EAGLE-3 achieves up to 5x speedups

The earliest predecessor was blockwise parallel decoding by Stern, Shazeer, and Uszkoreit at NeurIPS 2018 (Stern et al., 2018, Blockwise Parallel Decoding for Deep Autoregressive Models, arXiv:1811.03115). Their insight was simple: train auxiliary prediction heads on top of a Transformer to guess multiple future tokens, then verify them in parallel. The catch was that verification used greedy matching (accept the longest correct prefix), which only works for greedy (argmax) decoding. Sampling-based generation, where the model draws from a probability distribution rather than always picking the top token, could not use this scheme without changing the output distribution.

Non-autoregressive translation models (NAT) attacked the problem from a different angle, generating all tokens simultaneously but accepting a quality penalty. Masked-predict models like CMLM refined this with iterative refinement, but the gap to autoregressive quality persisted. The field wanted parallel generation without quality loss, and that required a new theoretical foundation.

[IMAGE: Timeline visualization showing the three lineages converging: blockwise parallel decoding (greedy), non-autoregressive models (approximate), and speculative decoding (exact sampling), with the key innovation at each branch point]

How Speculative Decoding Actually Works

The core mechanism rests on two observations and one algorithm. The observations are about hardware and task difficulty; the algorithm is a modified rejection sampling scheme that turns those observations into a provably lossless speedup.

The Memory Bandwidth Bottleneck

A single decoding step through a large Transformer reads every parameter from GPU memory once. For a 70B-parameter model stored in FP16, that is approximately 140 GB of data per token. An NVIDIA A100 has roughly 2 TB/s of memory bandwidth, so reading the full model takes about 70 ms, regardless of how little arithmetic is actually performed. The A100's peak compute throughput of 312 TFLOPS means it could perform roughly 22 trillion floating-point operations in that same 70 ms, yet a single decoding step for a 70B model uses only a small fraction of that capacity.

This arithmetic intensity gap is the opportunity speculative decoding exploits. Verifying \(\gamma\) draft tokens requires reading the model weights once (the same cost as generating one token) but performing \(\gamma\) times more compute. Since compute is the cheap resource, verification is nearly free in wall-clock time.

\[\text{Speedup} \approx \frac{\mathbb{E}[\text{accepted tokens per step}]}{\text{time ratio of verify vs. single decode}} = \frac{\mathbb{E}[\tau]}{1 + c}\]

where \(\tau\) is the number of accepted tokens (plus the one resampled token) and \(c\) is the relative cost of running the draft model.

The Draft-Verify Loop

The algorithm proceeds in rounds. Each round has three phases:

Phase 1: Draft. A small, fast model \(M_q\) (the draft model) autoregressively generates \(\gamma\) candidate tokens \(x_1, x_2, \ldots, x_\gamma\), sampling each from its own distribution \(q(x_t | x_{<t})\). Because \(M_q\) is small (often 50-500x fewer parameters than the target), this is fast.

Phase 2: Verify. The large target model \(M_p\) runs a single forward pass over the entire candidate sequence, computing \(p(x_t | x_{<t})\) for all \(\gamma\) positions simultaneously. This is where the parallelism pays off: reading the target model's weights once scores all candidates.

Phase 3: Accept or Resample. For each position \(t\) from 1 to \(\gamma\), the algorithm decides whether to accept the draft token \(x_t\):

\[\text{Accept } x_t \text{ with probability } \min\!\left(1, \frac{p(x_t | x_{<t})}{q(x_t | x_{<t})}\right)\]

If the target model assigns higher probability to the draft token than the draft model did, acceptance is guaranteed. If the target model assigns lower probability, the token is accepted with probability equal to the ratio. On the first rejection at position \(k\), the algorithm resamples token \(k\) from the adjusted residual distribution:

\[p'(x) = \text{norm}\!\left(\max\!\left(0,\; p(x | x_{<k}) - q(x | x_{<k})\right)\right)\]

This resampling step is what makes the guarantee work. The combination of probabilistic acceptance and residual resampling produces tokens from exactly the distribution \(p\), not an approximation. The proof is a direct application of rejection sampling theory (see Leviathan et al., 2022, Section 3).

[IMAGE: Step-by-step walkthrough of one draft-verify cycle: draft model produces 5 tokens with their probabilities, target model scores all 5, acceptance ratios computed, tokens 1-3 accepted, token 4 rejected and resampled from residual distribution]

Why the Distribution is Preserved

The mathematical guarantee is worth understanding precisely because it is the property that distinguishes speculative decoding from all approximate speedup methods. Consider a single token position. Let \(p(x)\) be the target distribution and \(q(x)\) be the draft distribution. The draft proposes token \(x\) sampled from \(q\). The probability of accepting \(x\) and it equaling some specific value \(v\) is:

\[\Pr[\text{accept } v] = q(v) \cdot \min\!\left(1, \frac{p(v)}{q(v)}\right) = \min(q(v), p(v))\]

The probability of rejection (for any token) is $1 - \sum_v \min(q(v), p(v))$. On rejection, the resampled token comes from \(\text{norm}(\max(0, p - q))\), which contributes exactly \(\max(0, p(v) - q(v))\) after normalization. Summing acceptance and resampled contributions for any value \(v\) gives exactly \(p(v)\). The output token follows distribution \(p\) regardless of what \(q\) looks like; a better \(q\) merely increases the acceptance rate and thus the speedup.

flowchart TD
    A["Draft model samples token x ~ q"] --> B{"Compare p(x) vs q(x)"}
    B -->|"p(x) >= q(x)"| C["Always accept"]
    B -->|"p(x) < q(x)"| D["Accept with prob p(x)/q(x)"]
    D -->|"Accepted"| E["Use draft token"]
    D -->|"Rejected"| F["Resample from norm(max(0, p - q))"]
    C --> E
    E --> G["Output follows distribution p exactly"]
    F --> G

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

    class A blue
    class B amber
    class C,D purple
    class E,F teal
    class G emerald

Choosing the Draft Model

The draft model's quality directly controls the acceptance rate \(\alpha\). If \(\alpha\) is the average per-token acceptance probability and the draft proposes \(\gamma\) tokens, the expected number of accepted tokens per round is approximately:

\[\mathbb{E}[\tau] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}\]

For \(\alpha = 0.7\) and \(\gamma = 5\), this gives roughly 2.8 tokens per round. The theoretical maximum speedup is bounded by $1/(1 - \alpha)$ as \(\gamma\) grows. Getting \(\alpha\) above 0.5 is necessary for speculative decoding to break even; below that threshold, the overhead of running the draft model and the wasted verification cycles can actually slow things down.

Good draft models share the target model's tokenizer and are trained on similar data. Common pairings include T5-Small (60M) drafting for T5-XXL (11B), LLaMA-68M for LLaMA-2-70B, and distilled variants specifically trained to approximate a larger model's distribution. The draft model's parameter count is typically 50-500x smaller than the target.

[IMAGE: Heat map of acceptance rates across token positions for a draft-target pair on a code generation task, showing high acceptance on boilerplate tokens and low acceptance on semantically critical tokens like variable names and logic operators]

Seeing It in Motion

Tree-Structured Speculation

Chain speculation proposes a single sequence of \(\gamma\) tokens. If the draft model makes a wrong guess at position 2, positions 3 through \(\gamma\) are wasted even if some of them would have been correct under a different prefix. Tree-structured speculation addresses this by proposing a branching tree of candidates, where each node represents an alternative continuation.

flowchart TD
    ROOT["Prompt<br/>(verified)"] --> A["the"]
    ROOT --> B["a"]
    A --> A1["quick"]
    A --> A2["lazy"]
    A1 --> A1a["brown"]
    A1 --> A1b["red"]
    A2 --> A2a["dog"]
    B --> B1["small"]
    B --> B2["large"]
    B1 --> B1a["cat"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class ROOT blue
    class A,B purple
    class A1,A2,B1,B2 teal
    class A1a,A1b,A2a,B1a emerald

The target model scores all nodes in a single forward pass using a tree attention mask that restricts each node's attention to its ancestors. Verification then walks the tree from root to leaves, accepting the longest valid path. SpecInfer (Miao et al., 2024, SpecInfer: Accelerating LLM Serving with Tree-based Speculative Inference and Verification, ASPLOS 2024) pioneered this approach using multiple small draft models, each contributing branches to the speculation tree, achieving 1.5-2.8x speedups for distributed inference and 2.6-3.5x for offloading scenarios.

Sequoia (Chen et al., 2024, Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding, arXiv:2402.12374) refined tree speculation with a dynamic programming algorithm that optimizes tree topology for specific hardware. Rather than using a fixed tree shape, Sequoia computes the optimal depth and branching factor given the target hardware's memory bandwidth and compute throughput. On an A100, Sequoia achieved up to 4.04x speedup for LLaMA-2-7B and 3.73x for LLaMA-2-13B.

Self-Speculative Methods

The requirement for a separate draft model creates operational complexity: you need to train, store, and serve an additional model that shares the target's tokenizer and approximates its distribution. Self-speculative methods eliminate this requirement by deriving the draft from the target model itself.

sequenceDiagram
    participant P as Prompt
    participant D as Draft Mechanism
    participant T as Target Model (Full)
    participant O as Output

    Note over D: Medusa: Extra MLP heads
    Note over D: EAGLE: Feature-space predictor
    Note over D: Layer-skip: Early exit at layer k

    P->>D: Input tokens
    D->>D: Generate gamma draft tokens (fast)
    D->>T: Draft tokens + original prompt
    T->>T: Single forward pass, score all positions
    T->>O: Accept/reject via modified rejection sampling
    Note over O: 2-5 tokens produced per verification step

Medusa (Cai et al., 2024, Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv:2401.10774) adds multiple lightweight MLP heads on top of the target model's last hidden layer. Each head \(k\) predicts the token at position \(t + k\) given the hidden state at position \(t\). The original model weights are frozen; only the new heads are fine-tuned. Combined with tree-based attention for verification, Medusa achieves 2.2-3.6x speedups. The elegance is that no separate model exists; the draft heads are a few extra parameters bolted onto the existing model.

EAGLE (Li et al., 2024, EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv:2401.15077) takes a different approach: instead of predicting tokens directly, it predicts the target model's second-to-top-layer features. The insight is that autoregression in feature space is more predictable than in token space, because the feature representation captures richer contextual information. EAGLE uses a single lightweight Transformer layer that takes the current feature and the token one step ahead as input, producing the next feature from which a token is decoded. EAGLE-2 extended this with dynamic draft trees whose topology adapts based on the draft model's confidence at each node, achieving 2.5-5x speedups. EAGLE-3 (arXiv:2503.01840) eliminated the need for separate training data by using the target model's own hidden states during training, reaching up to 4.79x on LLaMA-3.3-70B.

Layer-skipping methods derive the draft by executing only a subset of the target model's Transformer layers. Draft & Verify demonstrated that skipping every other layer produces tokens of sufficient quality for speculation, with no retraining or architectural changes. The target model's full depth then verifies the drafts. This approach is operationally the simplest because it requires exactly zero additional parameters.

[IMAGE: Architecture comparison of three self-speculative approaches: Medusa (extra MLP heads on final layer), EAGLE (feature-space prediction with lightweight Transformer), and layer-skipping (early exit after layer k), showing parameter counts and where each taps into the target model]

By the Numbers

Method Draft Source Speedup Range Acceptance Rate Extra Parameters Training Required
Standard speculative decoding (Leviathan et al., 2022) Separate small model 2-3x 0.6-0.8 Full draft model None (use existing model)
Speculative sampling (Chen et al., 2023) Separate small model 2-2.5x (Chinchilla 70B) approximately 0.7 Full draft model None
SpecInfer (Miao et al., 2024) Multiple small models + tree 1.5-3.5x Varies by branch Multiple draft models None
Sequoia (Chen et al., 2024) Small model + optimized tree 2.3-4.0x Hardware-dependent Full draft model None
Medusa (Cai et al., 2024) Extra MLP heads 2.2-3.6x approximately 0.7 Approximately 0.5-2% of target Fine-tune heads only
EAGLE-2 (Li et al., 2024) Feature-space predictor 2.5-5.0x approximately 0.8 1 Transformer layer Train predictor
EAGLE-3 (2025) Feature-space predictor Up to 4.79x approximately 0.8 1 Transformer layer Train with target hidden states
Lookahead decoding (Fu et al., 2024) Jacobi iteration (no draft) 1.5-1.8x (chat), up to 4x (code) N/A (n-gram matching) None None

Complexity analysis. Standard decoding for \(n\) tokens requires \(n\) sequential forward passes through the target model, each costing \(O(P)\) memory reads where \(P\) is the parameter count. Speculative decoding with acceptance rate \(\alpha\) and speculation length \(\gamma\) reduces the expected number of target model forward passes to approximately \(n \cdot (1 - \alpha) / (1 - \alpha^{\gamma+1})\), while adding \(n\) cheap draft model passes of cost \(O(P_d)\) where \(P_d \ll P\).

[IMAGE: Log-scale bar chart comparing tokens generated per second for Llama-2-70B across five methods: standard decoding, basic speculative decoding (2 draft models), Medusa, EAGLE-2, and EAGLE-3, annotated with the draft model size used in each case]

A Concrete Example

Consider a code completion task where the target model is LLaMA-2-70B (140 GB in FP16) and the draft model is a distilled 160M-parameter variant (320 MB). The speculation length is \(\gamma = 5\).

Setup. The user types: def fibonacci(n): and the model needs to generate a Python implementation.

Round 1: Draft phase. The draft model quickly generates 5 tokens:

Position Draft token Draft probability \(q\)
1 \n 0.92
2 if 0.78
3 n 0.85
4 <= 0.61
5 1 0.72

This takes approximately 1.5 ms total (5 sequential passes through 320 MB of weights).

Round 1: Verify phase. The target model runs one forward pass over all 5 draft positions plus the next, computing its own probabilities:

Position Draft token Target \(p\) Draft \(q\) Ratio \(p/q\) Decision
1 \n 0.95 0.92 1.03 Accept (ratio >= 1)
2 if 0.81 0.78 1.04 Accept
3 n 0.88 0.85 1.04 Accept
4 <= 0.44 0.61 0.72 Random draw: 0.72 >= rand(0.55)? Yes, accept
5 1 0.75 0.72 1.04 Accept

This verification pass takes approximately 70 ms (one read of the 140 GB model).

Result. All 5 draft tokens accepted, plus the target model generates one bonus token (:) from its distribution at position 6. Total: 6 tokens in approximately 72 ms, versus 6 x 70 ms = 420 ms for standard decoding. That is a 5.8x speedup for this round.

Round 2: Draft phase. The draft continues: \n, return, n, \n, return. Target verifies, accepts the first 3, rejects \n at position 4 (target prefers else), and resamples. Output: 4 tokens in approximately 72 ms.

Aggregate. Over the full generation (approximately 40 tokens for a compact fibonacci implementation), suppose the average acceptance per round is 3.5 tokens. Standard decoding: 40 x 70 ms = 2.8 seconds. Speculative decoding: approximately 12 rounds x 72 ms = 864 ms. Effective speedup: 3.2x.

[IMAGE: Token-by-token trace of the fibonacci example showing accepted tokens in green, rejected tokens in red, resampled tokens in blue, with timing annotations for each phase]

Where It Breaks

Speculative decoding is not a universal accelerator. Several conditions degrade or eliminate its benefit.

High batch sizes erode the advantage. The speedup depends on inference being memory-bandwidth-bound. At batch sizes above approximately 8-16, the arithmetic intensity increases and the GPU's compute units become the bottleneck instead of memory reads. Empirical measurements show EAGLE-based speculative decoding dropping from 1.3x at batch size 2 to 0.7x (slower than baseline) at batch size 48 on standard benchmarks. This is the single biggest limitation for high-throughput serving, where providers want to maximize tokens per GPU-second across many concurrent requests.

Low acceptance rates make it counterproductive. When the draft and target distributions diverge significantly (different training data, different model families, or highly creative/low-temperature sampling), acceptance rates drop below 0.5 and the overhead of running the draft model plus the wasted verification cycles exceeds the savings. Tasks requiring precise factual recall or mathematical reasoning tend to have lower acceptance rates than fluent text continuation.

Draft model selection is non-trivial. The draft model must share the target's tokenizer and approximate its distribution well enough to maintain high acceptance. Training a good draft model is itself an engineering investment. Off-the-shelf small models from the same family work reasonably well, but purpose-trained drafters consistently outperform them.

KV cache overhead doubles. Both the draft and target model maintain their own key-value caches. For long-context workloads, this can be a significant memory pressure, especially when the draft model's cache is proportionally large relative to available GPU memory.

Variable-length acceptance complicates batching. In a batch of requests, different sequences accept different numbers of tokens per round, creating ragged sequences that complicate the continuous batching strategies used by modern serving systems. Padding wastes compute; re-batching adds latency.

[IMAGE: Plot of effective speedup vs. batch size for speculative decoding, showing the crossover point where speculative decoding becomes slower than standard decoding, annotated with memory-bound vs. compute-bound regimes]

Alternative Designs

Approach Strengths Weaknesses Best when
Standard speculative decoding (separate draft model) Provably lossless, no target model changes, works with any compatible draft Requires maintaining a separate model, moderate acceptance rates You have a well-matched small model and serve at low-to-medium batch sizes
Tree speculation (SpecInfer, Sequoia) Higher acceptance through branching, hardware-aware optimization More complex implementation, higher memory for tree attention Acceptance rate is moderate and you can afford the engineering complexity
Medusa (extra heads) No separate model, simple architecture, moderate training cost Requires fine-tuning the heads, fixed tree structure You can afford a short fine-tuning run and want operational simplicity
EAGLE/EAGLE-2/EAGLE-3 (feature-space prediction) Highest speedups (up to 5x), dynamic tree topology, strong acceptance rates Requires training the predictor layer, feature-space coupling Maximum single-request latency reduction is the priority
Lookahead decoding (Jacobi iteration) Training-free, no draft model, works with any model Lower speedups than draft-based methods (1.5-1.8x typical) You cannot train or store any additional components
Layer-skipping (self-speculative) Zero extra parameters, no training, operationally trivial Lower acceptance than trained methods, not all architectures support clean layer skipping You need a quick win with zero infrastructure changes
Quantized target as draft (same model, lower precision) Draft is "free" from the same model, good distribution match Quantization artifacts reduce acceptance, still reads the model twice You already have a quantized variant deployed

[IMAGE: Radar chart comparing the six approaches across five axes: speedup magnitude, operational complexity, memory overhead, training requirement, and batch-size scalability]

How It Is Used in Practice

Google's December 2024 retrospective confirmed that speculative decoding is deployed in multiple Google products, specifically naming AI Overviews in Google Search as a beneficiary (Leviathan et al., 2024, Looking back at speculative decoding, Google Research Blog). The technique "remains a significant part of the optimizations" for large-scale products that are continuously optimized.

The serving framework ecosystem adopted speculative decoding as a first-class feature during 2024-2025. vLLM supports both draft-model-based speculation and EAGLE-style self-speculation, with AWS contributing P-EAGLE (parallel EAGLE) for higher throughput. TensorRT-LLM from NVIDIA includes speculative decoding as a built-in optimization, and Baseten published a detailed production deployment guide documenting their experience running it at scale. SGLang integrated EAGLE-3 as its default speculation method, and Snowflake's Arctic Inference project demonstrated some of the fastest speculative decoding results in vLLM.

Amazon SageMaker AI announced EAGLE-based adaptive speculative decoding as a native feature, automatically selecting speculation parameters based on workload characteristics.

The practical deployment pattern is converging: EAGLE-3 or a close variant as the speculation method, integrated into the serving framework rather than bolted on, with dynamic tree depth that adapts to the current request's acceptance rate. Organizations that cannot train a draft model default to layer-skipping or Lookahead decoding as zero-setup alternatives.

[IMAGE: System architecture diagram of a production speculative decoding deployment: load balancer, request router, GPU cluster with draft and target model co-located on each GPU, continuous batching scheduler, KV cache manager, and monitoring dashboard showing acceptance rates per request]

Insights Worth Remembering

  1. The bottleneck in LLM inference is reading model weights from memory, not computing with them. Speculative decoding exploits this by trading cheap compute for expensive memory reads.

  2. The rejection sampling guarantee is exact, not approximate. This is what makes speculative decoding fundamentally different from quantization, pruning, or distillation, all of which trade quality for speed.

  3. Acceptance rate is the single number that determines whether speculative decoding helps or hurts. Below approximately 0.5, the overhead dominates. The expected tokens per round follows \((1 - \alpha^{\gamma+1}) / (1 - \alpha)\); small improvements in \(\alpha\) compound rapidly.

  4. Self-speculative methods (Medusa, EAGLE) are winning the deployment race over separate-draft-model approaches because they eliminate the operational burden of maintaining, versioning, and co-deploying a second model.

  5. Tree speculation is strictly more powerful than chain speculation, but the improvement is largest when the per-token acceptance rate is moderate (0.5-0.7). At very high acceptance rates, the chain rarely needs alternatives.

  6. The technique's Achilles heel is high-batch-size serving. When batch size pushes inference into the compute-bound regime, the free compute that speculation relies on disappears. This is why speculative decoding helps latency more than throughput.

  7. Google's confirmation that speculative decoding is used in AI Overviews is significant because it validates the technique at a scale of billions of queries, not just academic benchmarks.

  8. The progression from separate draft models to self-speculation to training-free methods mirrors a recurring pattern in ML systems: the initial breakthrough requires extra infrastructure, then the community eliminates that requirement.

Open Questions

Can speculative decoding scale to high-throughput batch serving? MagicDec and SPIRe represent early attempts to make speculation work at batch sizes above 32, but the fundamental tension between memory-bound and compute-bound regimes remains. Whether architectural innovations (sparse attention during verification, speculative scheduling that groups requests by expected acceptance rate) can close this gap is an active area of investigation.

What is the optimal way to train a draft model? Knowledge distillation from the target is the current best practice, but the relationship between distillation objective, acceptance rate, and end-to-end speedup is not well characterized. EAGLE-3's approach of using the target's own hidden states during training suggests that tighter coupling between draft and target training may yield further gains.

How does speculative decoding interact with other optimizations? Quantization, KV cache compression, paged attention, and continuous batching all modify the inference regime. The interactions are not always additive; for example, quantizing the target model changes its distribution, which affects acceptance rates for a draft model trained against the unquantized version. Systematic study of these interactions is sparse.

Can speculation extend beyond text? Early work has applied the paradigm to image generation (diffusion models with speculative denoising) and speech synthesis. Whether the memory-bandwidth bottleneck and the "easy tokens" observation hold equally in these domains is not yet established, though initial results are promising.

Will hardware co-design change the picture? If future accelerators provide even higher compute-to-memory-bandwidth ratios, the opportunity for speculation grows. Conversely, if memory bandwidth scales faster than compute (unlikely given current trends), the technique becomes less relevant. The trajectory of hardware design will shape how important speculative decoding remains over the next five years.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Technical Blogs

Additional Resources

Sign in to save and react.
Share Copied