← Concept library

Inference Optimisation

Paged Attention and the KV Memory Manager

How vLLM's PagedAttention treats the KV cache like OS virtual memory, cutting fragmentation waste from 60-80% to under 4% and roughly doubling to quadrupling serving throughput.

advanced · 8 min read · Premium

You know the KV cache is the memory bottleneck in LLM serving. What is less obvious is that, before 2023, most of that memory was never used. Kwon et al. measured production serving stacks (Orca-style systems built on FasterTransformer) wasting 60-80% of KV cache memory not to the cache itself but to how it was allocated: contiguous slabs, sized for the worst case, that sat mostly empty. PagedAttention is the fix, and the framing is the whole trick. Stop treating the KV cache as an array; treat it as virtual memory, with pages, a page table, and on-demand allocation. That single reframe is what let vLLM pack 2-4x more concurrent requests onto the same GPU.

This concept assumes you already know what a KV cache holds and why it grows linearly with sequence length. The question here is narrower and more interesting: given a fixed pool of HBM, how do you hand it out to hundreds of concurrent, variable-length sequences without wasting most of it?

Where a contiguous allocator bleeds memory

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied