KV-Cache Memory and Bandwidth

A 70B-parameter LLaMA model in fp16 weighs about 140 GB. Run it with a batch of 32 requests at sequence length 4096 and the KV cache alone adds another 80+ GB. On most accelerators, the cache exceeds the model weights before you even start worrying about activations. That is the memory wall for transformer inference, and it is not a software bug you can patch away.

What the cache stores and why it must grow

Every transformer decoder layer computes attention over the full context. For a given token at position t, the attention mechanism needs the key and value projections for every previous position 0 … t-1. Without caching, you would recompute those projections from scratch on every new token - O(t) work per step, O(t²) total. The cache trades memory for a flat O(1) projection cost per step.

The size of that trade is exact and predictable. For a single request:

bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes

For Llama-3-70B in fp16 (80 layers, 8 KV heads, 128-dim heads, 2 bytes):

2 * 80 * 8 * 128 * 2 = 327,680 bytes ≈ 0.32 MB per token

At 4096 tokens that is 1.3 GB per request. With 32 concurrent requests: 41 GB before any model weights are loaded. This arithmetic is why serving systems spend so much effort on memory management.

The bandwidth bottleneck during autoregressive generation

Prefill (processing the prompt) is compute-bound: you have a full matrix of queries attending to all prompt positions simultaneously, which saturates tensor cores. Generation (producing tokens one at a time) is a completely different regime.

During generation, each forward pass processes a single new token against a growing cache. The compute load is tiny - one query vector per head per layer. But before any floating-point arithmetic happens, the GPU must stream the entire KV cache for that request from HBM (high-bandwidth memory) into SRAM. That transfer is proportional to the context length and happens on every single step.

Phase	Bottleneck	Typical utilisation
Prefill	Compute (tensor cores)	40-80% MFU
Generation (short context)	Memory bandwidth	< 10% MFU
Generation (long context)	Memory bandwidth (dominant)	< 5% MFU

An A100 SXM5 has 2 TB/s of HBM bandwidth. A 70B model in fp16 is 140 GB of weights alone - reading them once takes 70 ms at peak bandwidth. Add a 4096-token KV cache for 32 requests (41 GB) and each generation step requires streaming roughly 180 GB, giving a theoretical lower bound around 90 ms per token just from bandwidth. In practice you are well below peak, so observed latency is higher.

What the cache stores and why it must grow

The bandwidth bottleneck during autoregressive generation

Keep reading with Pro.