← Concept library

Applied LLMs

KV-Cache Memory and Bandwidth

The key-value cache trades GPU memory capacity for inference speed, and understanding how that trade interacts with memory bandwidth is what separates fast serving systems from slow ones.

intermediate · 8 min read · Premium

A 70B-parameter LLaMA model in fp16 weighs about 140 GB. Run it with a batch of 32 requests at sequence length 4096 and the KV cache alone adds another 80+ GB. On most accelerators, the cache exceeds the model weights before you even start worrying about activations. That is the memory wall for transformer inference, and it is not a software bug you can patch away.

What the cache stores and why it must grow

Every transformer decoder layer computes attention over the full context. For a given token at position t, the attention mechanism needs the key and value projections for every previous position 0 … t-1. Without caching, you would recompute those projections from scratch on every new token - O(t) work per step, O(t²) total. The cache trades memory for a flat O(1) projection cost per step.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied