PagedAttention and Continuous Batching: How vLLM Stopped Wasting Your GPU

In 2023 a group at UC Berkeley measured something embarrassing about the way large language models were being served. On systems that were considered the best available, between 60% and 80% of the memory reserved for the key-value cache was never actually used (Kwon et al., 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, arXiv:2309.06180). The GPUs were not slow. The model weights were not too large. The memory was simply being thrown away by an allocation strategy borrowed, without much thought, from the way you would store a fixed-size tensor.

The fix did not come from a new attention variant or a bigger chip. It came from an idea that operating systems have used since the 1960s: stop demanding contiguous memory, page it instead. That single change, packaged as PagedAttention and paired with a scheduler that batches at the granularity of a single token step, is why a serving stack today can push two to four times the throughput of its 2022 predecessor at the same latency.

Why this matters: The cost of running an LLM in production is dominated by how many requests you can pack onto one GPU at once. Memory fragmentation, not compute, is usually what caps that number. PagedAttention attacks the memory problem directly, and continuous batching turns the freed memory into concurrent requests.

TL;DR

LLM serving is memory-bound during decoding, not compute-bound. The key-value (KV) cache for in-flight requests, not the model weights, is what limits how many requests fit on a GPU.
Naive serving reserves a contiguous slab sized for each request's maximum possible length. Most of that slab is never used, wasting 60-80% of KV-cache memory to internal and external fragmentation (Kwon et al., 2023).
PagedAttention splits the KV cache into fixed-size blocks (commonly 16 tokens) and lets a request's logical token sequence map to physically scattered blocks, exactly like OS virtual memory pages map to physical frames.
Continuous batching (iteration-level scheduling), introduced by Orca, lets the scheduler add and evict requests between individual decode steps rather than waiting for an entire batch to finish (Yu et al., 2022, Orca, OSDI 2022).
Together they raise serving throughput 2-4x over Orca and FasterTransformer at matched latency, with larger gains on longer sequences and on decoding strategies that share prefixes (Kwon et al., 2023).
The cost is real: a custom attention kernel, a block manager, and a scheduler that can preempt and recompute. The payoff is near-zero memory waste and copy-on-write prefix sharing.

At a Glance

flowchart LR
  R[Incoming requests] --> S[Iteration-level scheduler]
  S --> B[Block manager]
  B --> P[Paged KV cache in GPU blocks]
  P --> K[PagedAttention kernel]
  K --> O[Next-token logits]
  O --> S
  class R blue
  class S purple
  class B purple
  class P slate
  class K purple
  class O teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The scheduler picks which requests run each step, the block manager hands out physical KV blocks on demand, and the PagedAttention kernel reads keys and values from blocks that need not be contiguous. The loop runs once per generated token.

[IMAGE: Side-by-side GPU memory bar chart, one showing a 13B model with weights, activations, and a large striped "wasted KV reservation" region; the other showing the same model with a densely packed paged KV region and a much larger active batch]

Before vLLM

To see why paging mattered, you have to look at what an LLM actually stores while it generates text. A Transformer decoder caches the key and value vectors for every token it has already processed, so that each new token can attend to the full history without recomputing it. This KV cache grows by one slot per layer per token, every single step, and it is freed only when the request finishes.

[IMAGE: Diagram of a Transformer decoder generating tokens, with the KV cache shown as a growing stack of key/value slots, one row added per layer per generated token, annotated "freed only when the request ends"]

The size is not a rounding error. For a 13-billion-parameter model in half precision, the KV cache consumes roughly 800 KB for every token in a sequence (derived from 2 tensors x 40 layers x 5120 hidden dimensions x 2 bytes, the OPT-13B configuration). A single 2,048-token request therefore needs about 1.6 GB of cache. On an 40 GB A100 where the weights already occupy roughly 26 GB, you are fighting over the remaining 14 GB, and at 1.6 GB per long request you can hold fewer than ten of them.

The pre-2023 serving systems made this worse than it had to be. They allocated the KV cache the way you allocate a NumPy array: one contiguous block, sized to the request's maximum possible length. A request that might generate up to 2,048 tokens got a 2,048-token slab the moment it arrived, even if it would stop after 30 tokens. The unused remainder, called internal fragmentation, was dead memory for the request's whole lifetime. When requests finished and freed their slabs, the gaps they left rarely matched the size of the next request, producing external fragmentation. Measured across real workloads, only 20-40% of the reserved KV memory held actual token state (Kwon et al., 2023).

timeline
  title Evolution of LLM serving systems
  2017 : Transformer introduced (Vaswani et al.)
  2020 : Request-level batching; pad to longest in batch
  2022 : Orca adds iteration-level scheduling (OSDI)
  2023 : vLLM adds PagedAttention (SOSP); 2-4x throughput
  2024 : Paging and continuous batching become the default across serving stacks

There was a second, independent inefficiency on the scheduling side. Early systems batched at the request level: gather N requests, run them together until all N finish, then start the next batch. Because generations have wildly different lengths, the whole batch ran at the speed of its slowest member, and a freshly arrived request waited for the entire batch to drain before it could even start. The GPU sat idle holding finished sequences hostage to one straggler.

Orca attacked that scheduling problem first. Its insight, iteration-level scheduling, was to return control to the scheduler after every single token step instead of every request (Yu et al., 2022). A finished request leaves the batch immediately; a new one joins at the next step. Orca reported throughput up to 36.9x higher than a strong baseline at the same latency for its target workloads. But Orca still inherited the contiguous-reservation memory model, so it could not pack the batch as densely as the freed scheduling slots would allow. The memory ceiling remained.

How PagedAttention Actually Works

The core move is a refusal: PagedAttention refuses to store a request's KV cache in one contiguous span. Operating systems solved the analogous problem decades ago. A process sees a flat virtual address space, but the OS maps it onto physically scattered page frames through a page table, allocating frames only when the process touches them. PagedAttention applies exactly this structure to the KV cache.

Blocks, not slabs

The KV cache is divided into fixed-size blocks, each holding the keys and values for a fixed number of tokens (16 is the common default). A request no longer owns a contiguous region. Instead it owns a block table, a small array that maps the request's logical block numbers to physical block numbers somewhere in GPU memory. Logical block 0 might live at physical block 1,038; logical block 1 at physical block 7; they need not be adjacent.

Blocks are allocated lazily. A request that has generated 30 tokens holds two blocks (32 slots, 30 used). The most it ever wastes is the unfilled tail of its last block, at most 15 token-slots regardless of how long the request might eventually become. Internal fragmentation drops from "the entire unused tail of a 2,048-slot reservation" to "part of one block." External fragmentation disappears entirely, because every free block is the same size and interchangeable.

The attention kernel has to follow the pointers

Standard attention assumes keys and values sit in contiguous memory so it can stream them with a single strided read. Once the cache is paged, that assumption breaks. The contribution that makes the scheme work is a custom CUDA kernel that, for each query, walks the request's block table, gathers the relevant blocks, and computes attention over them as if they were contiguous.

For a query vector \(q\) at the current position attending over cached keys \(K\) and values \(V\), the kernel computes the familiar

\[\text{Attention}(q, K, V) = \text{softmax}\!\left(\frac{q K^\top}{\sqrt{d_k}}\right) V\]

but it materializes \(K\) and \(V\) block by block, indexing through the block table rather than a single base pointer plus offset. The arithmetic is unchanged; only the memory access pattern is paged. The cost is a slightly more complex kernel and an extra indirection per block, which in practice is cheap relative to the bandwidth saved by fitting a larger batch.

Because blocks are addressed indirectly, two requests can point their block tables at the same physical block. This is the second payoff. When you sample several completions from one prompt (parallel sampling, or beam search), the prompt's KV blocks are identical across all branches. PagedAttention stores them once and lets every branch reference them, with a copy-on-write step that clones a block only when a branch first writes into it, exactly as a forking process shares pages until one side mutates them. For beam search and parallel sampling, where prior systems duplicated the entire prompt cache per beam, this cuts memory for the shared prefix to a single copy.

flowchart TD
  subgraph Logical["Request logical view"]
    L0[Logical block 0] --> L1[Logical block 1] --> L2[Logical block 2]
  end
  BT[Block table] --> Logical
  subgraph Physical["Physical GPU blocks"]
    P7[Phys 7]
    P1038[Phys 1038]
    P12[Phys 12]
  end
  L0 -. maps to .-> P1038
  L1 -. maps to .-> P7
  L2 -. maps to .-> P12
  class L0,L1,L2 blue
  class BT purple
  class P7,P1038,P12 slate
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The logical sequence is dense and ordered; the physical blocks are scattered. The block table is the page table that hides the scattering from the model.

[IMAGE: Two-panel analogy figure, left panel an OS process with a page table mapping virtual pages to scattered physical frames, right panel an LLM request with a block table mapping logical KV blocks to scattered GPU blocks, drawn to emphasize the structural identity]

Continuous batching, on top

Paging frees memory, but something has to spend it. That something is continuous batching, the production form of Orca's iteration-level scheduling. At each decode step the scheduler looks at the pool of free blocks and the queue of waiting requests, and admits as many as will fit. When a request emits its end-of-sequence token, its blocks return to the free pool that same step and a waiting request can claim them immediately.

When demand exceeds memory, the scheduler preempts. It can evict a running request's blocks, either by swapping them to CPU memory or by discarding them and recomputing the cache later from the prompt. Eviction is all-or-nothing per request to keep the policy simple, and preempted requests are restored when blocks free up. This is the mechanism that lets the system run at the edge of its memory budget without crashing when a burst arrives.

Seeing It in Motion

The decode loop is a negotiation between the scheduler and the block manager, repeated once per token. The sequence below shows one step where a new request is admitted and a finished one is reclaimed.

sequenceDiagram
  participant C as Client
  participant Sc as Scheduler
  participant BM as Block manager
  participant K as PagedAttention kernel
  C->>Sc: New request (prompt)
  Sc->>BM: Request blocks for prompt
  BM-->>Sc: Allocated physical blocks
  Sc->>K: Run one decode step for the batch
  K-->>Sc: Next tokens for all active requests
  Sc->>BM: Free blocks of finished requests
  Sc->>C: Stream generated tokens
  Note over Sc,BM: Loop repeats every token

Notice that admission, execution, and reclamation all happen inside a single iteration. No request waits for the batch to drain. The contrast with request-level batching is the difference between a bus that leaves only when full and empties completely before the next boarding, and a turnstile that admits and releases people continuously.

The state of a single request over its life is small but worth making explicit, because preemption makes it a real state machine rather than a straight line.

stateDiagram-v2
  [*] --> Waiting
  Waiting --> Running: blocks available
  Running --> Running: emit token, maybe grow a block
  Running --> Preempted: memory pressure
  Preempted --> Waiting: blocks reclaimed
  Running --> Finished: end-of-sequence
  Finished --> [*]

A request can bounce between Running and Preempted several times under load. Each preemption either swaps its blocks to host memory or drops them for later recomputation, a tradeoff between PCIe bandwidth and redundant compute that the scheduler resolves by policy.

By the Numbers

The headline result is that vLLM, the system built on PagedAttention, serves 2-4x more requests per second than Orca and FasterTransformer at the same latency, with the gap widening for longer sequences, larger models, and shared-prefix decoding (Kwon et al., 2023). The memory accounting underneath that number is the more instructive figure.

Quantity	Naive contiguous serving	PagedAttention	Source
KV-cache memory actually used	20-40%	over 96%	Kwon et al., 2023
Memory wasted to fragmentation	60-80%	under 4%	Kwon et al., 2023
Internal fragmentation per request	up to full unused reservation	at most one partial block (under 16 tokens)	derived from block model
Throughput vs. Orca / FasterTransformer	1x baseline	2-4x at matched latency	Kwon et al., 2023
Continuous vs. static batching (OPT-13B)	1x	up to 23x	Anyscale, 2023

The 23x figure from Anyscale's independent benchmark deserves a caveat: it isolates the scheduling change on a specific workload (OPT-13B, a particular request-length distribution) and represents an upper bound, not a number you should expect on every deployment (Anyscale, 2023, How continuous batching enables 23x throughput). The vLLM paper's 2-4x is the more conservative, apples-to-apples comparison because it holds latency fixed and compares against systems that already had strong batching.

The complexity story is simple and unchanged by paging. Decoding is \(O(n)\) attention reads per token over a sequence of length \(n\), and the KV cache grows \(O(n)\) per request. Paging does not change the asymptotics; it changes the constant factor on memory waste, which is what binds in practice.

Method	Per-token attention cost	KV memory per request	Memory waste
Contiguous reservation	\(O(n)\)	\(O(L_{max})\) reserved up front	high, \(O(L_{max} - n)\)
Paged blocks	\(O(n)\)	\(O(n)\) grown on demand	bounded by one block

Here \(n\) is the current length and \(L_{max}\) the maximum the request might reach. The waste term is the whole game.

[IMAGE: Line chart of achievable batch size versus average sequence length, two curves; contiguous reservation collapses quickly while paged stays high, with a shaded gap labeled "concurrent requests gained"]

A Concrete Example

Walk one GPU through a realistic minute. Take an A100 with 40 GB, serving OPT-13B in fp16. The weights take roughly 26 GB, leaving about 14 GB for the KV cache. At 800 KB per token, that budget is about 17,500 token-slots. Use a block size of 16 tokens, so the pool holds about 1,090 physical blocks.

Three requests arrive:

Request	Prompt tokens	Will generate	Naive reservation	Paged blocks at step 40
A	500	1,500	2,000 slots reserved	34 blocks (540 tokens)
B	1,800	200	2,000 slots reserved	116 blocks (1,840 tokens)
C	50	30	2,000 slots reserved	6 blocks (80 tokens)

Under naive contiguous serving, all three reserve 2,000 slots up front: 6,000 slots committed against a 17,500-slot budget, so at most eight such requests fit and the GPU runs eight-wide. Yet at step 40, the requests are actually holding 540, 1,840, and 80 tokens of real state. Request C, which will stop in moments, is sitting on a 2,000-slot reservation it will use 80 slots of. That is internal fragmentation you can see.

Under PagedAttention, the same three requests hold 156 blocks (2,496 token-slots) of real, in-use cache, growing one block at a time only as they generate. The remaining roughly 15,000 slots stay in the free pool, available for new arrivals. Instead of eight concurrent requests the GPU now sustains several dozen, because each one consumes what it uses rather than what it might use.

[IMAGE: Stacked timeline of the GPU's free-block pool over 70 decode steps, showing blocks consumed as requests A, B, C grow and a sharp return of blocks to the pool when C finishes at step 70]

Now request C finishes at step 70. Its 6 blocks return to the pool that same step, and a queued request D claims them on step 71 without waiting for A or B. If a burst of arrivals exhausts the pool, the scheduler preempts request A, swaps its 40-odd blocks to host memory, serves the burst, then restores A when blocks free. No request is dropped; the system degrades by delaying, not failing.

The arithmetic that makes this concrete: a single shared 500-token prompt sampled four ways costs 2,000 token-slots under naive duplication but only 500 shared slots plus the divergent tails under copy-on-write block sharing, a 4x reduction on the prompt portion for that decoding pattern.

Where It Breaks

Paging is not free, and the places it strains are worth naming.

The custom attention kernel is the first. PagedAttention needs a hand-written kernel that gathers non-contiguous blocks, and that kernel must be maintained for every hardware target and every attention variant (grouped-query attention, sliding-window attention, and so on). When a new attention shape appears, the paged kernel has to catch up. A system that stored the cache contiguously could often lean on vendor libraries; a paged system owns more of its own low-level code.

[IMAGE: U-shaped tradeoff curve, x-axis block size from 4 to 64 tokens, two overlaid curves for wasted-tail memory (falling) and kernel inefficiency from table lookups (rising), with the combined cost minimized near 16]

Block size is a genuine tradeoff, not a free parameter. Small blocks (say 8 tokens) minimize the wasted tail but multiply the number of block-table lookups and reduce the contiguity the kernel can exploit, hurting kernel efficiency. Large blocks (say 32) read more efficiently but waste more on the partial tail and make sharing coarser. The common choice of 16 is a compromise, and the right value shifts with sequence-length distribution and hardware.

Preemption has a sharp edge under sustained overload. When memory is exhausted and the scheduler must evict, it pays either PCIe bandwidth (swap to host) or redundant FLOPs (recompute the cache from the prompt). Under a heavy, bursty load with long prompts, recomputation can dominate and throughput sags precisely when you need it most. Paging removes the memory ceiling but moves the failure mode into the scheduler's eviction policy.

Finally, the gains are workload-dependent. If your requests are uniformly short and arrive at a steady rate, contiguous reservation wastes little and the 2-4x shrinks. The win is largest exactly where the old model was worst: variable-length generations, long contexts, and prefix-sharing decode strategies. A benchmark that does not include those will undersell paging; a vendor benchmark that maximizes them will oversell it.

Alternative Designs

PagedAttention is now the default, but it is not the only way to manage KV memory, and a 2024 line of work argues paging traded one complexity for another.

Approach	Strengths	Weaknesses	Best when
Contiguous reservation (pre-2023)	Simple kernels, vendor libraries usable	60-80% memory waste, low batch size	Short, uniform requests; legacy stacks
PagedAttention (vLLM)	Near-zero waste, prefix sharing, high batch size	Custom kernel, block-size tuning, eviction policy	Variable-length, long-context, shared-prefix serving
vAttention	Keeps virtual contiguity via CUDA virtual memory; uses unmodified attention kernels	Newer, relies on low-level driver features	Wanting paging's memory wins without a custom kernel
RadixAttention (SGLang)	Automatic prefix sharing via a radix tree over the cache	Adds a tree to maintain; tuned for shared-prefix workloads	Agent and few-shot workloads with heavy prefix reuse

The vAttention work makes a pointed argument: PagedAttention forces a non-contiguous kernel only because it manages paging at the application level, when the GPU's own virtual-memory machinery can provide virtual contiguity over physical fragments and let you keep the standard, faster attention kernel (Prabhu et al., 2024, vAttention, arXiv:2405.04437). RadixAttention, in SGLang, pushes the prefix-sharing idea further by indexing cached prefixes in a radix tree so reuse is automatic across requests rather than only within a sampling group (Zheng et al., 2023, SGLang / RadixAttention, arXiv:2312.07104). Neither erases paging's core insight; both refine where and how the indirection lives.

How It Is Used in Practice

vLLM moved from research artifact to default serving layer with unusual speed. Its source is open (github.com/vllm-project/vllm), and the PagedAttention-plus-continuous-batching pattern was absorbed by the broader ecosystem, including Hugging Face's text-generation-inference and NVIDIA's TensorRT-LLM, which adopted paged KV cache management of their own. When a hosted inference provider quotes you a per-million-token price, the economics behind that price almost certainly assume paged memory and continuous batching, because without them the achievable batch size, and therefore the cost per token, would be several times worse.

The operational considerations are the ones the failure modes predict. Operators tune block size and the fraction of GPU memory handed to the KV pool against their real traffic. They watch the preemption rate as a saturation signal: a rising rate means the memory budget is the bottleneck and it is time to add a replica or shorten max lengths. They lean on prefix sharing deliberately in agent and retrieval workloads, where many requests share a long system prompt, because that is where copy-on-write blocks turn into the largest savings. The same mechanism that looks like an OS trick in a paper becomes, in production, the dial that sets the cloud bill.

[IMAGE: Architecture diagram of a production serving deployment, showing a load balancer fanning to multiple vLLM replicas, each with a GPU holding model weights and a paged KV pool, annotated with the preemption-to-host path]

Insights Worth Remembering

LLM decoding is bound by memory, not arithmetic. The question "how many requests fit?" is usually "how much KV cache fits?", and answering it well is worth more than a faster matmul.
The expensive resource is not the model weights, which are loaded once and shared; it is the per-request KV cache, which is private, grows every step, and is freed only at the end.
Fragmentation, not capacity, was the real enemy. The GPUs of 2022 had enough memory; the allocation strategy squandered it.
Borrowing the operating-system idea of paging worked because the KV cache has the same shape as a process address space: a logically dense sequence that does not need to be physically contiguous.
Two orthogonal ideas compound here. Continuous batching frees scheduling slots; paging frees the memory to fill them. Either alone leaves throughput on the table.
Indirection that enables sharing is the quiet win. Once blocks are addressed through a table, copy-on-write prefix sharing falls out almost for free, and it is decisive for beam search and shared-prompt workloads.
The headline multiplier is workload-shaped. Quote 2-4x for mixed realistic traffic and treat any single large number as an upper bound for a specific benchmark.

Open Questions

The settled part is the memory model: paging the KV cache is now standard, and the measured throughput gains over contiguous reservation are well replicated. What remains open sits at the edges.

Whether application-level paging or hardware-level virtual memory is the right home for the indirection is genuinely unresolved. vAttention's claim that you can keep a standard kernel by leaning on CUDA virtual memory is promising, but the comparison depends on driver behavior and has not yet displaced PagedAttention in practice (Prabhu et al., 2024). Expect this to be contested as serving stacks mature.

Eviction policy is an active design space rather than a solved one. The choice between swapping blocks to host memory and recomputing them is currently made by simple heuristics; a policy that predicts request length or accounts for the cost of recomputation could plausibly do better, but no approach has clearly won. It is an open question how much a smarter scheduler is worth on top of paging.

The interaction with longer contexts is the largest unknown. As context windows stretch toward and past a million tokens, the KV cache for a single request can exceed a GPU's memory entirely, and paging blocks within one device is no longer enough. Distributed KV caches across devices, hierarchical caches spilling to host and disk, and compression or eviction of old tokens are all being explored, and which combination wins is not yet decided. The 2023 insight solved fragmentation on one GPU; the next round is about a cache too large for any single one.

Sources and Further Reading