Paged Attention as a Memory Manager

Before PagedAttention, a 13B-parameter model running on an A100 80 GB card typically wasted between 60% and 80% of its KV-cache memory. The tokens were there; the GPU memory was available. The waste came from the same fragmentation that plagued early malloc implementations: blocks reserved upfront for sequences that turned out shorter than expected, gaps between allocations that were too small to reuse, and no mechanism for two requests to share identical prefix cache entries. Fixing that was not an algorithmic problem. It was a memory-management problem, and the solution lifted the idea almost verbatim from operating-system virtual memory.

The KV-Cache Fragmentation Problem

During autoregressive decoding, every token's key and value projections must be retained for the full lifetime of the sequence. A naive implementation pre-allocates a contiguous slab of GPU memory large enough for the maximum sequence length the model supports. Three things go wrong.

Internal fragmentation. If the maximum context is 4096 tokens but the average request completes at 512 tokens, roughly 87% of each reserved block is never touched.

External fragmentation. As requests finish and their slabs are freed, the free memory exists as scattered holes that cannot be combined to satisfy a new large request.

No prefix sharing. Two requests that start with the same system prompt independently store identical key-value tensors. There is no mechanism to deduplicate them.

The result is that the actual throughput-limiting resource is not compute: it is the allocator's inability to pack live KV entries densely.

Virtual Memory as the Blueprint

The OS virtual memory system solved the same class of problem in 1962. Physical RAM is divided into fixed-size pages (typically 4 KB). Processes see a contiguous virtual address space. The kernel maintains a page table that maps virtual page numbers to physical page frames. Allocation is granular; the physical layout is invisible to the process; copy-on-write lets two processes share a physical page until one of them writes to it.

PagedAttention maps this design onto the KV cache with minimal translation:

OS concept	PagedAttention equivalent
Physical page frame	KV block (e.g. 16 tokens x head_dim x 2 x dtype bytes)
Virtual page number	Logical block index in a sequence
Page table	Per-sequence block table stored on CPU
Copy-on-write	Forked sequences share physical KV blocks until they diverge

The block size (number of tokens per KV block) is a compile-time constant. Typical values are 8, 16, or 32. Smaller blocks improve packing density; larger blocks improve GPU memory access coalescing. The tradeoff is hardware-specific and is usually tuned empirically.

How the Kernel Executes Paged Attention

During the prefill phase, each new sequence is assigned logical block indices sequentially. The block manager allocates physical KV blocks on demand from a free-list and writes entries into the block table.

The KV-Cache Fragmentation Problem

Virtual Memory as the Blueprint

How the Kernel Executes Paged Attention

Keep reading with Pro.