← Concept library

Applied LLMs

Prefill vs Decode

LLM inference splits into two hardware-distinct phases - a compute-bound prefill that processes all prompt tokens in parallel, and a memory-bandwidth-bound decode that generates tokens one at a time, each with fundamentally different bottlenecks on the same GPU.

intermediate · 7 min read · Premium

A 7B-parameter model on an A100 can saturate the GPU's tensor cores during prefill, then immediately become memory-bandwidth-limited once decode begins. These are not just two stages of the same job; they are two different workloads wearing the same hardware.

The Two Phases

Every autoregressive LLM inference call passes through exactly two phases.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied