Paged Attention and the KV Memory Manager

You know the KV cache is the memory bottleneck in LLM serving. What is less obvious is that, before 2023, most of that memory was never used. Kwon et al. measured production serving stacks (Orca-style systems built on FasterTransformer) wasting 60-80% of KV cache memory not to the cache itself but to how it was allocated: contiguous slabs, sized for the worst case, that sat mostly empty. PagedAttention is the fix, and the framing is the whole trick. Stop treating the KV cache as an array; treat it as virtual memory, with pages, a page table, and on-demand allocation. That single reframe is what let vLLM pack 2-4x more concurrent requests onto the same GPU.

This concept assumes you already know what a KV cache holds and why it grows linearly with sequence length. The question here is narrower and more interesting: given a fixed pool of HBM, how do you hand it out to hundreds of concurrent, variable-length sequences without wasting most of it?

Paged Attention and the KV Memory Manager

Where a contiguous allocator bleeds memory

Keep reading with Pro.