Applied LLMs
Prefill vs Decode
LLM inference splits into two hardware-distinct phases - a compute-bound prefill that processes all prompt tokens in parallel, and a memory-bandwidth-bound decode that generates tokens one at a time, each with fundamentally different bottlenecks on the same GPU.
intermediate · 7 min read · Premium
A 7B-parameter model on an A100 can saturate the GPU's tensor cores during prefill, then immediately become memory-bandwidth-limited once decode begins. These are not just two stages of the same job; they are two different workloads wearing the same hardware.
The Two Phases
Every autoregressive LLM inference call passes through exactly two phases.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.