Applied LLMs
KV-Cache Memory and Bandwidth
The key-value cache trades GPU memory capacity for inference speed, and understanding how that trade interacts with memory bandwidth is what separates fast serving systems from slow ones.
intermediate · 8 min read · Premium
A 70B-parameter LLaMA model in fp16 weighs about 140 GB. Run it with a batch of 32 requests at sequence length 4096 and the KV cache alone adds another 80+ GB. On most accelerators, the cache exceeds the model weights before you even start worrying about activations. That is the memory wall for transformer inference, and it is not a software bug you can patch away.
What the cache stores and why it must grow
Every transformer decoder layer computes attention over the full context. For a given token at position t, the attention mechanism needs the key and value projections for every previous position 0 … t-1. Without caching, you would recompute those projections from scratch on every new token - O(t) work per step, O(t²) total. The cache trades memory for a flat O(1) projection cost per step.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.