← Concept library

Inference Optimisation

vLLM and Continuous Batching

Why static batching wastes most of your GPU on variable-length workloads, and how iteration-level scheduling combined with PagedAttention raises throughput by an order of magnitude.

intermediate · 8 min read

A serving stack that processes requests one at a time leaves an H100 90%+ idle - the GPU is bandwidth-bound on a single decode step. The obvious fix is to batch, but LLM requests do not look like classic ML inference: prompts vary from 10 to 100,000 tokens, generations vary from 1 to 4,000 tokens, and requests arrive at random times. Static batching - waiting for a fixed window, padding everything to the longest sequence - throws most of that batching win away. Continuous batching is the scheduling change that recovers it, and PagedAttention is the memory-management change that makes continuous batching practical.

Why static batching wastes the GPU

In static batching you collect N requests, pad inputs to the longest prompt, and run them through the model together. Two problems:

  1. Padding waste. A batch of one 4k prompt and seven 200-token prompts spends 95% of its attention FLOPs on pad tokens.
  2. Tail blocking. The whole batch finishes when the longest generation finishes. A 4000-token completion holds up seven 50-token completions for the duration.

On a typical chat workload (Anyscale 2023) static batching achieves around 1/20th of the throughput continuous batching does on the same hardware.

Continuous (in-flight) batching

The Orca paper (Yu et al., OSDI 2022) introduced iteration-level scheduling: instead of batching at the request level, batch at the decoding-step level. Each iteration the scheduler:

  1. Picks all currently in-flight sequences.
  2. For each one, runs exactly one decode step.
  3. Removes any sequence that emitted EOS or hit max length.
  4. Admits new sequences from the queue if HBM has room.

A request that finishes early frees its slot immediately. A new request joins the batch on the next iteration without waiting. Throughput rises because the GPU stays busy and padding disappears.

The trick is that prefill (processing the initial prompt) and decode (generating one token) have very different shapes. vLLM's scheduler interleaves them; SGLang and TensorRT-LLM use a "chunked prefill" trick where long prompts are sliced and processed alongside ongoing decodes so that no single prefill blocks the batch.

PagedAttention, the enabler

Continuous batching is only useful if you can actually fit many concurrent sequences in HBM. With a contiguous-allocation KV cache, you reserve max_seq_len per sequence and waste 60-80% of HBM. PagedAttention splits the cache into 16-token blocks and allocates them on demand, dropping waste below 4%. The combined effect (continuous batching + PagedAttention) is the 2-4x throughput Kwon et al. report in the vLLM paper, and the 20x+ throughput Anyscale reports against static batching baselines.

Throughput vs latency

Continuous batching trades single-request latency for system throughput:

  • TTFT (time to first token) improves vs static batching because requests no longer wait for a batch window to fill.
  • TPOT (time per output token) can get worse under heavy load because each decode step processes more sequences (more weights to read - no, weights are read once per step regardless of batch size; the cost is the larger KV-cache read).
  • Throughput (tokens/sec across the system) improves dramatically.

The right knobs:

  • max_num_seqs - cap on concurrent sequences. Too high pushes TPOT past your SLO.
  • max_num_batched_tokens - cap on tokens processed per step. Limits prefill bursts.
  • gpu_memory_utilization - fraction of HBM vLLM owns. Lower it if other processes share the GPU.

Picking a serving stack

Stack Sweet spot Strength Weakness
vLLM High-throughput batch serving on NVIDIA/AMD Best general-purpose throughput; ships PagedAttention and prefix caching Single-stream latency is not its priority
TensorRT-LLM NVIDIA-only production serving Best raw throughput on H100/H200, FP8 kernels Build complexity, vendor lock-in, slower model support
SGLang Structured generation, agentic workloads RadixAttention prefix sharing, fast constrained decoding Younger project, smaller deploy footprint
llama.cpp Local, CPU, Apple Silicon, edge GGUF quantisation, Metal kernels, runs anywhere Single-stream focus, weaker multi-user throughput
Hugging Face TGI Easy HF model deployment Good defaults, broad model support Throughput trails vLLM and TensorRT-LLM

Rules of thumb: vLLM if you want one decision that works for most chat/RAG workloads. TensorRT-LLM if you have an NVIDIA-only fleet and an engineer to run the build. SGLang if your workload is heavy on structured output or shared prefixes. llama.cpp if it has to run on a Mac or a $200 SBC.

When it falls down

  • Single-user, latency-critical serving. Continuous batching's benefit comes from concurrency. If your QPS is 1, you are paying scheduler overhead for nothing.
  • Very heterogeneous models in one process. vLLM serves one model per instance; running 10 small models means 10 processes and 10 lots of weight-load.
  • Streaming with strict per-token latency SLOs. Under load, TPOT variance is higher than with a one-request-at-a-time server. Measure p99.

Further reading