Prefill vs Decode

A 7B-parameter model on an A100 can saturate the GPU's tensor cores during prefill, then immediately become memory-bandwidth-limited once decode begins. These are not just two stages of the same job; they are two different workloads wearing the same hardware.

The Two Phases

Every autoregressive LLM inference call passes through exactly two phases.

Prefill takes the full prompt (P tokens) and processes all of them simultaneously in a single forward pass. Every token attends to every other token in the prompt. The attention computation is O(P²) in time and O(P) in KV cache memory. Because you are doing a large matrix multiplication - (batch × heads × P × d_head) × (d_head × d_model) - you are keeping the GPU's arithmetic units busy. This is a compute-bound workload.

Decode takes the last generated token and runs one forward pass to produce the next token. At each step, only a single new row is added to the KV cache; the rest is just read back from memory. The matrix multiplications shrink to (batch × heads × 1 × d_head), which is a vector-matrix product. The GPU spends most of its cycle budget waiting for weight tensors and KV cache to arrive from HBM, not performing arithmetic. This is a memory-bandwidth-bound workload.

The roofline model makes this crisp. Arithmetic intensity (FLOPs per byte transferred) determines which bound applies:

Arithmetic intensity = FLOPs / bytes_read

Prefill:  FLOPs ∝ 2 × P × d_model²   (large matrix product)
          bytes ∝ d_model²             (load weights once)
          → intensity grows with P; sits above the roofline ridge

Decode:   FLOPs ∝ 2 × d_model²        (one token - vector×matrix)
          bytes ∝ d_model²             (still load the full weights)
          → intensity ≈ 1 FLOPs/byte;  far below the ridge

On an A100 SXM4, peak FP16 compute is ~312 TFLOP/s and peak HBM2e bandwidth is ~2 TB/s. The ridge point sits at roughly 156 FLOP/byte. Single-token decode at batch size 1 achieves ~1-2 FLOP/byte, meaning you are using perhaps 1-2% of peak compute but potentially 60-80% of peak bandwidth. You are not "wasting" compute; you are hitting a different ceiling entirely.

Why This Matters for Serving Systems

Mixing prefill and decode in the same batch creates interference. A long prompt arriving mid-decode causes a "prefill stall": decode steps pause while the prefill request monopolises the GPU for tens or hundreds of milliseconds, spiking time-to-first-token (TTFT) for all inflight requests.

Continuous batching (as in Orca / vLLM) mitigates head-of-line blocking by scheduling at the token level, but the CPU/GPU scheduling overhead and the prefill stall problem remain when prompt lengths are long.

Chunked prefill (introduced in Sarathi-Serve) addresses the stall by splitting a large prefill into fixed-size chunks - say 512 tokens at a time - and interleaving them with decode steps. A large prompt no longer blocks decode for hundreds of milliseconds; it yields the GPU after each chunk. The tradeoff is that chunked prefill increases time-to-first-token for the affected request (more scheduling rounds), while improving inter-token latency (ITL) for already-decoding requests.

The Two Phases

Why This Matters for Serving Systems

Keep reading with Pro.