Inference Optimisation
KV Cache
Why decoder inference is quadratic without a KV cache and linear with one, and why managing that cache is now the dominant memory problem in LLM serving.
intermediate · 8 min read
Autoregressive decoders generate one token at a time. Naively, each new token requires re-running attention over every prior token, which makes generating a sequence of length n cost O(n^2) FLOPs and read O(n^2) of activations from HBM. The KV cache is the trivial-once-you-see-it fix: at each step you stash the per-layer K and V tensors for every token you have already produced, so the next step only computes K and V for the single new token and reads the rest. That collapses the per-step cost from O(n) to O(1) (in terms of attention work over the past) and the whole sequence to O(n).
The memory bill
The cache size for a single sequence is:
bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * dtype_bytes
The leading 2 is K + V. For fp16/bf16, dtype_bytes = 2. Some realistic numbers:
| Model | layers | kv_heads | head_dim | bytes/token | 8k context | 128k context |
|---|---|---|---|---|---|---|
| Llama-3-8B | 32 | 8 | 128 | 128 KiB | 1.0 GiB | 16 GiB |
| Llama-3-70B | 80 | 8 | 128 | 320 KiB | 2.5 GiB | 40 GiB |
| Llama-3-405B | 126 | 8 | 128 | 504 KiB | 3.9 GiB | 63 GiB |
| Mistral-7B (MHA) | 32 | 32 | 128 | 512 KiB | 4.0 GiB | 64 GiB |
Grouped-query attention (GQA) - where n_kv_heads is much smaller than n_query_heads - exists almost entirely to shrink this table. Multi-Query Attention (MQA, n_kv_heads = 1) is the extreme. DeepSeek's Multi-Head Latent Attention (MLA) goes further by storing a low-rank latent that K and V are projected from.
Why this dominates serving cost
On modern GPUs, attention during decode is memory-bandwidth-bound, not compute-bound. You read the full KV cache for every generated token. With Llama-3-70B at 32k context on an H100 (3.35 TB/s HBM), you upper-bound at roughly 3.35e12 / 10e9 = ~335 tokens/sec for a single stream - and that is before counting the weights you also have to read. Cache size directly caps both batch size (how many concurrent users fit in HBM) and per-stream throughput.
PagedAttention
In a classic allocator you reserve a contiguous slab for each sequence's max length. With variable-length traffic that wastes 60-80% of HBM to fragmentation and reservation. vLLM's PagedAttention borrows the OS virtual-memory trick: split the KV cache into fixed-size blocks (typically 16 tokens), keep a per-sequence block table that maps logical positions to physical blocks, and allocate blocks on demand. Waste drops to under 4% and you can serve 2-4x more concurrent requests on the same hardware (Kwon et al., SOSP 2023).
Prefix caching
If 10,000 users hit the same 2k-token system prompt, you should compute its KV once and share it. Prefix caching hashes block contents (and the prefix that led to them) so an incoming request can reuse any matching prefix already in HBM. The win is largest for chat with long system prompts, RAG with shared retrieved chunks, and few-shot prompts. vLLM, SGLang, and TensorRT-LLM all ship this; cache hit rates of 50-90% are common in production chat workloads.
Eviction
HBM is finite, so old blocks must be evicted when a new request arrives. The usual policy is LRU at the block level, with a few wrinkles:
- Blocks belonging to an in-flight generation cannot be evicted.
- Blocks pinned by a known-hot prefix (your system prompt) want a separate, longer-lived tier.
- Some systems offload cold blocks to CPU/NVMe rather than dropping them - cheap to keep, expensive to refetch.
When it falls down
- Very long context plus large batch. You run out of HBM before you run out of compute. Cache offloading, quantisation of the cache itself (KIVI, AWQ-KV at 4-bit), or MLA-style architectural changes are the levers.
- Decode-heavy workloads with low locality. If every request has a unique prompt, prefix caching gives you nothing; you are paying the bookkeeping overhead for free.
- Speculative decoding interactions. Speculation reads cache for both draft and verifier; the cache pressure goes up, not down.
Further reading
- Efficient Memory Management for Large Language Model Serving with PagedAttention - the original vLLM paper with the page-table design and waste measurements.
- vLLM Automatic Prefix Caching design doc - hash scheme, block lifecycle, and LRU eviction as implemented.
- vLLM launch blog post - readable intro to PagedAttention with throughput numbers.