Adapter Modules

Fine-tuning BERT for ten downstream tasks means storing ten copies of a 110-million-parameter model. That is 440 GB of checkpoints for a single research experiment. Adapter modules were invented to make that absurd. The original proposal by Houlsby et al. (2019) showed that you can match full fine-tuning on the GLUE benchmark within 0.4% by training only 3.6% of the parameters per task. Everything else stays frozen.

What an adapter module is

An adapter is a small sub-network inserted into each transformer layer. The standard Houlsby design adds two adapters per layer: one after the multi-head attention projection and one after the feed-forward sublayer. Each adapter is a three-operation sequence:

A down-projection from the hidden dimension d to a bottleneck dimension r (where r << d).
A non-linearity (typically GELU or ReLU).
An up-projection back to d.

A residual connection wraps the whole unit so that, at initialisation, the adapter is the identity function. This means you can insert adapters into a pretrained model and immediately begin training without destabilising the pretrained representations.

# Pseudocode for a single adapter forward pass
def adapter(x, W_down, W_up):
    h = gelu(x @ W_down)   # shape: [batch, seq, r]
    return x + h @ W_up    # shape: [batch, seq, d]

If d = 768 and r = 64, the adapter adds only 2 * 768 * 64 + 64 = 98,432 parameters, versus the ~7 million in the surrounding transformer block. The ratio is roughly 1.4%.

Mathematically, the adapted output for a given sublayer f is:

y = f(x) + W_up * GeLU(W_down * f(x))

where W_down ∈ R^{d×r} and W_up ∈ R^{r×d}. Only W_down and W_up are updated during training; all of f's parameters are frozen.

Why bottleneck adapters work

Pretrained transformers already encode rich, general-purpose representations. Fine-tuning on a small downstream dataset mostly adjusts the output distribution, not the core representational capacity. Adapters exploit this: the frozen layers supply the heavy lifting; the adapters nudge the activations toward the target task.

The bottleneck structure acts as a form of regularisation. Because information must pass through a low-dimensional gate before being added back, the adapter cannot overfit by memorising arbitrary perturbations in the high-dimensional residual stream. This is especially valuable on small datasets (thousands of examples) where full fine-tuning risks degrading pretrained structure.

A useful comparison across adapter placement strategies:

Strategy	Adapters per layer	Trainable params (BERT-base)	GLUE avg vs full FT
Houlsby (Series, 2-per-layer)	2	~3.6%	-0.4%
Pfeiffer (Series, 1-per-layer)	1	~1.8%	-0.4%
Parallel adapters	1	~1.8%	comparable
LoRA (rank 8)	-	~0.3%	comparable

The Pfeiffer variant, which inserts a single adapter only after the feed-forward sublayer, achieves nearly identical performance at half the parameter overhead.

One of the more practical advantages of the bottleneck design is composability. Because the pretrained backbone is identical across tasks, you can swap adapter weights at inference time without reloading the base model. AdapterHub (adapterhub.ml) distributes pretrained adapters for hundreds of tasks; each checkpoint is roughly 1 MB, compared to hundreds of MB for a full model.

AdapterFusion (Pfeiffer et al., 2020) extends this further. It trains a small attention mechanism over a collection of pretrained adapters, learning to weight their contributions per token rather than picking one. This avoids catastrophic forgetting across tasks, because no single adapter's weights are modified during the fusion stage.

A two-stage workflow looks like this:

Stage 1. Train an independent adapter for each source task. Freeze everything else.
Stage 2. Load all adapters frozen into the backbone, add a trainable AdapterFusion layer, and train only the fusion weights on the target task.

The result can outperform both single-task adapters and full multi-task fine-tuning on compositional transfer scenarios.

Adapters vs. LoRA: where they diverge

LoRA (Hu et al., 2021) is often described as a competing PEFT method, but the two solve slightly different problems. LoRA reparametrises weight updates as low-rank matrices added directly to the frozen weight matrices; adapters insert new modules into the forward pass.

The critical operational difference is inference latency. An adapter adds a forward pass through two linear layers on every token at every layer. This is a sequential bottleneck: the compute cannot be fused into the existing weight matrices. LoRA's low-rank updates, by contrast, can be merged into the original weights at serving time (W' = W + BA), leaving no inference overhead at all.

For a 7B-parameter model served at high throughput, a 5-10% per-layer latency penalty from adapters accumulates into a meaningful cost. For offline evaluation or low-traffic deployments, the gap is rarely decisive.

AdapterDrop (Rücklé et al., 2020) addresses the latency problem by dropping adapters from the lower transformer layers entirely. These layers encode mostly syntactic structure, which transfers well without adaptation. Dropping adapters from the bottom 11 of 12 BERT layers recovers roughly 70% of the inference speed penalty while retaining most of the task performance.

When it falls down

Very small bottleneck rank. Setting r too low (for example, r = 2 on a complex classification task) underfits badly. Unlike LoRA, where rank interacts with weight geometry in a well-studied way, the right r for adapters must be validated empirically and is sensitive to task diversity.

Multi-task interference during adapter training. Training a single adapter on multiple tasks simultaneously often hurts; the adapter bottleneck is too narrow to capture the variation. The recommended pattern is one adapter per task, with composition handled at the AdapterFusion stage.

Serving infrastructure. Most production inference frameworks (vLLM, TensorRT-LLM, SGLang) optimise the attention and FFN kernels heavily. Inserting non-standard bottleneck layers breaks kernel fusion and prevents the use of fused multi-head attention implementations. Deploying adapters at scale often requires custom CUDA extensions or accepting the latency penalty, whereas LoRA weights can be absorbed into standard matrix multiplication.

Catastrophic forgetting in the backbone. Adapters only protect the frozen backbone from forgetting; if you accidentally leave the backbone unfrozen (a common mistake when using libraries that default to full fine-tuning), you lose the composability property entirely.

Generative tasks. Most adapter literature benchmarks on discriminative tasks (classification, NER, question answering). For open-ended generation, the feedback from adapter weights to the frozen backbone's generative distribution is weaker, and quality gaps versus full fine-tuning are larger and harder to close by tuning r alone.

What an adapter module is

Why bottleneck adapters work

Composing and sharing adapters

Adapters vs. LoRA: where they diverge

When it falls down

Further reading