KV Cache

Autoregressive decoders generate one token at a time. Naively, each new token requires re-running attention over every prior token, which makes generating a sequence of length n cost O(n^2) FLOPs and read O(n^2) of activations from HBM. The KV cache is the trivial-once-you-see-it fix: at each step you stash the per-layer K and V tensors for every token you have already produced, so the next step only computes K and V for the single new token and reads the rest. That collapses the per-step cost from O(n) to O(1) (in terms of attention work over the past) and the whole sequence to O(n).

The memory bill

The cache size for a single sequence is:

bytes = 2 * n_layers * n_kv_heads * head_dim * seq_len * dtype_bytes

The leading 2 is K + V. For fp16/bf16, dtype_bytes = 2. Some realistic numbers:

Model	layers	kv_heads	head_dim	bytes/token	8k context	128k context
Llama-3-8B	32	8	128	128 KiB	1.0 GiB	16 GiB
Llama-3-70B	80	8	128	320 KiB	2.5 GiB	40 GiB
Llama-3-405B	126	8	128	504 KiB	3.9 GiB	63 GiB
Mistral-7B (MHA)	32	32	128	512 KiB	4.0 GiB	64 GiB

Grouped-query attention (GQA) - where n_kv_heads is much smaller than n_query_heads - exists almost entirely to shrink this table. Multi-Query Attention (MQA, n_kv_heads = 1) is the extreme. DeepSeek's Multi-Head Latent Attention (MLA) goes further by storing a low-rank latent that K and V are projected from.

Why this dominates serving cost

On modern GPUs, attention during decode is memory-bandwidth-bound, not compute-bound. You read the full KV cache for every generated token. With Llama-3-70B at 32k context on an H100 (3.35 TB/s HBM), you upper-bound at roughly 3.35e12 / 10e9 = ~335 tokens/sec for a single stream - and that is before counting the weights you also have to read. Cache size directly caps both batch size (how many concurrent users fit in HBM) and per-stream throughput.

PagedAttention

In a classic allocator you reserve a contiguous slab for each sequence's max length. With variable-length traffic that wastes 60-80% of HBM to fragmentation and reservation. vLLM's PagedAttention borrows the OS virtual-memory trick: split the KV cache into fixed-size blocks (typically 16 tokens), keep a per-sequence block table that maps logical positions to physical blocks, and allocate blocks on demand. Waste drops to under 4% and you can serve 2-4x more concurrent requests on the same hardware (Kwon et al., SOSP 2023).

Prefix caching

If 10,000 users hit the same 2k-token system prompt, you should compute its KV once and share it. Prefix caching hashes block contents (and the prefix that led to them) so an incoming request can reuse any matching prefix already in HBM. The win is largest for chat with long system prompts, RAG with shared retrieved chunks, and few-shot prompts. vLLM, SGLang, and TensorRT-LLM all ship this; cache hit rates of 50-90% are common in production chat workloads.

Eviction

HBM is finite, so old blocks must be evicted when a new request arrives. The usual policy is LRU at the block level, with a few wrinkles:

Blocks belonging to an in-flight generation cannot be evicted.
Blocks pinned by a known-hot prefix (your system prompt) want a separate, longer-lived tier.
Some systems offload cold blocks to CPU/NVMe rather than dropping them - cheap to keep, expensive to refetch.

When it falls down

Very long context plus large batch. You run out of HBM before you run out of compute. Cache offloading, quantisation of the cache itself (KIVI, AWQ-KV at 4-bit), or MLA-style architectural changes are the levers.
Decode-heavy workloads with low locality. If every request has a unique prompt, prefix caching gives you nothing; you are paying the bookkeeping overhead for free.
Speculative decoding interactions. Speculation reads cache for both draft and verifier; the cache pressure goes up, not down.