← Concept library

Inference Optimisation

KV Cache

Why decoder inference is quadratic without a KV cache and linear with one, and why managing that cache is now the dominant memory problem in LLM serving.

intermediate · 8 min read

Autoregressive decoders generate one token at a time. Naively, each new token requires re-running attention over every prior token, which makes generating a sequence of length n cost O(n^2) FLOPs and read O(n^2) of activations from HBM. The KV cache is the trivial-once-you-see-it fix: at each step you stash the per-layer K and V tensors for every token you have already produced, so the next step only computes K and V for the single new token and reads the rest. That collapses the per-step cost from O(n) to O(1) (in terms of attention work over the past) and the whole sequence to O(n).

The memory bill

The cache size for a single sequence is:

bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * dtype_bytes

The leading 2 is K + V. For fp16/bf16, dtype_bytes = 2. Some realistic numbers:

Model layers kv_heads head_dim bytes/token 8k context 128k context
Llama-3-8B 32 8 128 128 KiB 1.0 GiB 16 GiB
Llama-3-70B 80 8 128 320 KiB 2.5 GiB 40 GiB
Llama-3-405B 126 8 128 504 KiB 3.9 GiB 63 GiB
Mistral-7B (MHA) 32 32 128 512 KiB 4.0 GiB 64 GiB

Grouped-query attention (GQA) - where n_kv_heads is much smaller than n_query_heads - exists almost entirely to shrink this table. Multi-Query Attention (MQA, n_kv_heads = 1) is the extreme. DeepSeek's Multi-Head Latent Attention (MLA) goes further by storing a low-rank latent that K and V are projected from.

Why this dominates serving cost

On modern GPUs, attention during decode is memory-bandwidth-bound, not compute-bound. You read the full KV cache for every generated token. With Llama-3-70B at 32k context on an H100 (3.35 TB/s HBM), you upper-bound at roughly 3.35e12 / 10e9 = ~335 tokens/sec for a single stream - and that is before counting the weights you also have to read. Cache size directly caps both batch size (how many concurrent users fit in HBM) and per-stream throughput.

PagedAttention

In a classic allocator you reserve a contiguous slab for each sequence's max length. With variable-length traffic that wastes 60-80% of HBM to fragmentation and reservation. vLLM's PagedAttention borrows the OS virtual-memory trick: split the KV cache into fixed-size blocks (typically 16 tokens), keep a per-sequence block table that maps logical positions to physical blocks, and allocate blocks on demand. Waste drops to under 4% and you can serve 2-4x more concurrent requests on the same hardware (Kwon et al., SOSP 2023).

Prefix caching

If 10,000 users hit the same 2k-token system prompt, you should compute its KV once and share it. Prefix caching hashes block contents (and the prefix that led to them) so an incoming request can reuse any matching prefix already in HBM. The win is largest for chat with long system prompts, RAG with shared retrieved chunks, and few-shot prompts. vLLM, SGLang, and TensorRT-LLM all ship this; cache hit rates of 50-90% are common in production chat workloads.

Eviction

HBM is finite, so old blocks must be evicted when a new request arrives. The usual policy is LRU at the block level, with a few wrinkles:

  • Blocks belonging to an in-flight generation cannot be evicted.
  • Blocks pinned by a known-hot prefix (your system prompt) want a separate, longer-lived tier.
  • Some systems offload cold blocks to CPU/NVMe rather than dropping them - cheap to keep, expensive to refetch.

When it falls down

  • Very long context plus large batch. You run out of HBM before you run out of compute. Cache offloading, quantisation of the cache itself (KIVI, AWQ-KV at 4-bit), or MLA-style architectural changes are the levers.
  • Decode-heavy workloads with low locality. If every request has a unique prompt, prefix caching gives you nothing; you are paying the bookkeeping overhead for free.
  • Speculative decoding interactions. Speculation reads cache for both draft and verifier; the cache pressure goes up, not down.

Further reading