Inference Optimisation
vLLM and Continuous Batching
Why static batching wastes most of your GPU on variable-length workloads, and how iteration-level scheduling combined with PagedAttention raises throughput by an order of magnitude.
intermediate · 8 min read
A serving stack that processes requests one at a time leaves an H100 90%+ idle - the GPU is bandwidth-bound on a single decode step. The obvious fix is to batch, but LLM requests do not look like classic ML inference: prompts vary from 10 to 100,000 tokens, generations vary from 1 to 4,000 tokens, and requests arrive at random times. Static batching - waiting for a fixed window, padding everything to the longest sequence - throws most of that batching win away. Continuous batching is the scheduling change that recovers it, and PagedAttention is the memory-management change that makes continuous batching practical.
Why static batching wastes the GPU
In static batching you collect N requests, pad inputs to the longest prompt, and run them through the model together. Two problems:
- Padding waste. A batch of one 4k prompt and seven 200-token prompts spends 95% of its attention FLOPs on pad tokens.
- Tail blocking. The whole batch finishes when the longest generation finishes. A 4000-token completion holds up seven 50-token completions for the duration.
On a typical chat workload (Anyscale 2023) static batching achieves around 1/20th of the throughput continuous batching does on the same hardware.
Continuous (in-flight) batching
The Orca paper (Yu et al., OSDI 2022) introduced iteration-level scheduling: instead of batching at the request level, batch at the decoding-step level. Each iteration the scheduler:
- Picks all currently in-flight sequences.
- For each one, runs exactly one decode step.
- Removes any sequence that emitted EOS or hit max length.
- Admits new sequences from the queue if HBM has room.
A request that finishes early frees its slot immediately. A new request joins the batch on the next iteration without waiting. Throughput rises because the GPU stays busy and padding disappears.
The trick is that prefill (processing the initial prompt) and decode (generating one token) have very different shapes. vLLM's scheduler interleaves them; SGLang and TensorRT-LLM use a "chunked prefill" trick where long prompts are sliced and processed alongside ongoing decodes so that no single prefill blocks the batch.
PagedAttention, the enabler
Continuous batching is only useful if you can actually fit many concurrent sequences in HBM. With a contiguous-allocation KV cache, you reserve max_seq_len per sequence and waste 60-80% of HBM. PagedAttention splits the cache into 16-token blocks and allocates them on demand, dropping waste below 4%. The combined effect (continuous batching + PagedAttention) is the 2-4x throughput Kwon et al. report in the vLLM paper, and the 20x+ throughput Anyscale reports against static batching baselines.
Throughput vs latency
Continuous batching trades single-request latency for system throughput:
- TTFT (time to first token) improves vs static batching because requests no longer wait for a batch window to fill.
- TPOT (time per output token) can get worse under heavy load because each decode step processes more sequences (more weights to read - no, weights are read once per step regardless of batch size; the cost is the larger KV-cache read).
- Throughput (tokens/sec across the system) improves dramatically.
The right knobs:
max_num_seqs- cap on concurrent sequences. Too high pushes TPOT past your SLO.max_num_batched_tokens- cap on tokens processed per step. Limits prefill bursts.gpu_memory_utilization- fraction of HBM vLLM owns. Lower it if other processes share the GPU.
Picking a serving stack
| Stack | Sweet spot | Strength | Weakness |
|---|---|---|---|
| vLLM | High-throughput batch serving on NVIDIA/AMD | Best general-purpose throughput; ships PagedAttention and prefix caching | Single-stream latency is not its priority |
| TensorRT-LLM | NVIDIA-only production serving | Best raw throughput on H100/H200, FP8 kernels | Build complexity, vendor lock-in, slower model support |
| SGLang | Structured generation, agentic workloads | RadixAttention prefix sharing, fast constrained decoding | Younger project, smaller deploy footprint |
| llama.cpp | Local, CPU, Apple Silicon, edge | GGUF quantisation, Metal kernels, runs anywhere | Single-stream focus, weaker multi-user throughput |
| Hugging Face TGI | Easy HF model deployment | Good defaults, broad model support | Throughput trails vLLM and TensorRT-LLM |
Rules of thumb: vLLM if you want one decision that works for most chat/RAG workloads. TensorRT-LLM if you have an NVIDIA-only fleet and an engineer to run the build. SGLang if your workload is heavy on structured output or shared prefixes. llama.cpp if it has to run on a Mac or a $200 SBC.
When it falls down
- Single-user, latency-critical serving. Continuous batching's benefit comes from concurrency. If your QPS is 1, you are paying scheduler overhead for nothing.
- Very heterogeneous models in one process. vLLM serves one model per instance; running 10 small models means 10 processes and 10 lots of weight-load.
- Streaming with strict per-token latency SLOs. Under load, TPOT variance is higher than with a one-request-at-a-time server. Measure p99.
Further reading
- Efficient Memory Management for Large Language Model Serving with PagedAttention - the vLLM paper, with both PagedAttention and the scheduler.
- vLLM launch blog - the practitioner version with throughput plots vs HF and TGI.
- How continuous batching enables 23x throughput in LLM inference (Anyscale) - clear walkthrough of static vs continuous batching with measured numbers.
- vllm-project/vllm on GitHub - reference implementation and the place to read the scheduler code.