Serving Many LoRA Adapters

A company fine-tunes Llama-3 70B for 500 enterprise clients, one per customer domain. Each adapter is roughly 50 MB. The base model is 140 GB in BF16. Naively, serving each adapter in its own GPU instance would require 500 x 140 GB of VRAM - an absurd number. The practical question is: how do you share one copy of the base model weights across all 500 adapters while routing each incoming request to the right fine-tuned variant, with throughput close to single-model serving?

That question is what the LoRA serving literature addresses, and the answer is more subtle than it first looks.

Why Naive Batching Breaks

Standard batched inference works because all requests in a batch share identical weight tensors: every matrix multiply is the same operation applied to different input vectors, so a single GEMM kernel covers the whole batch. The moment you introduce per-request adapter weights, that assumption collapses.

Recall the LoRA parameterisation for a weight matrix W:

output = x W^T  +  x (A B)^T
                    ^^^^^^^^
                    adapter contribution

where A (d x r) and B (r x k) are the low-rank matrices, and r is the rank (typically 4-64). Different requests in the same batch have different A and B. A naive loop over requests (one GEMM per request) serialises the adapter contributions and destroys GPU utilisation. You need a kernel that can compute the adapter contributions for all requests in a heterogeneous batch in one pass.

The second problem is memory. If you hold all adapter weights in GPU HBM simultaneously, even at rank 16 across all transformer layers you quickly exhaust VRAM for large adapter counts. The system must page adapters between CPU DRAM and GPU HBM based on which adapters are currently scheduled.

Punica: The Segmented Gather-Scatter Kernel

Punica (Chen et al., 2023) introduced the key kernel primitive: Segmented Gather Matrix-Vector Multiply (SGMV). The idea is compact enough to state precisely.

Given a batch of N requests, group them by adapter identity. For each group g with requests indexed by a segment index:

# pseudocode for the adapter contribution pass
for each request i in batch:
    y[i] += x[i] @ A[adapter_id[i]] @ B[adapter_id[i]]

The SGMV kernel executes this in a single GPU launch by treating the adapter index as a gather index into a packed tensor of A and B matrices. All requests sharing the same adapter_id hit the same memory region, enabling cache reuse. Different adapter groups run as independent warps within the same kernel.

The result: a GPU holds exactly one copy of the base model. The adapter matrices for the currently scheduled requests are loaded into a side buffer, and the SGMV kernel stitches their contributions onto the base model's output in one shot. Punica reported 12x higher throughput versus naive vLLM serving of multiple LoRA models on the same hardware.

Why Naive Batching Breaks

Punica: The Segmented Gather-Scatter Kernel

Keep reading with Pro.