Applied LLMs
Paged Attention as a Memory Manager
PagedAttention borrows the OS virtual-memory paging model to eliminate KV-cache fragmentation, letting a single GPU serve far more concurrent requests than contiguous allocation allows.
advanced · 8 min read · Premium
Before PagedAttention, a 13B-parameter model running on an A100 80 GB card typically wasted between 60% and 80% of its KV-cache memory. The tokens were there; the GPU memory was available. The waste came from the same fragmentation that plagued early malloc implementations: blocks reserved upfront for sequences that turned out shorter than expected, gaps between allocations that were too small to reuse, and no mechanism for two requests to share identical prefix cache entries. Fixing that was not an algorithmic problem. It was a memory-management problem, and the solution lifted the idea almost verbatim from operating-system virtual memory.
The KV-Cache Fragmentation Problem
During autoregressive decoding, every token's key and value projections must be retained for the full lifetime of the sequence. A naive implementation pre-allocates a contiguous slab of GPU memory large enough for the maximum sequence length the model supports. Three things go wrong.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.