Composing and Stacking Adapters

Suppose you have already fine-tuned a 7B model with one LoRA adapter for Spanish translation and a second for legal summarisation. A third client needs a Spanish-language legal summariser. The naive answer is to train from scratch. The more interesting answer is to ask whether those two adapters can be combined into something useful without touching the base model again.

That question is what adapter composition is about. It is harder than it sounds, because LoRA and bottleneck adapters are not simply additive in the parameter space, and the ways they can interfere are subtle.

What "composition" actually means

The term covers at least three distinct operations, and conflating them leads to confusion:

Operation	What happens	Typical use case
Linear interpolation	Weighted sum of two adapter delta-weights	Blending styles or tasks
Sequential stacking	Adapter A output feeds into Adapter B	Modular skill chaining
Attention-based fusion	Gating network selects across adapters per layer	Multi-task with learned routing

Each carries different assumptions about whether the adapters were trained on compatible objectives, whether their rank subspaces overlap, and whether the base model can tolerate the combined perturbation.

Linear interpolation: the simplest case

For LoRA, the adapter's effect on a weight matrix W is:

Delta W = (alpha / r) * B @ A

If you have two adapters with deltas Delta_1 and Delta_2, the linearly interpolated model uses:

W_eff = W + lambda_1 * Delta_1 + lambda_2 * Delta_2

This is what the PEFT library's add_weighted_adapter method implements, with combination_type="linear". The result lives in the same weight tensor; no adapter scaffolding survives at inference time.

When does this work? When the two tasks are genuinely compatible - the adapters push the model in related directions, and their effects approximately superpose. When does it fail? When the adapters have learned conflicting representations: language A pushes certain attention heads into one subspace while domain B pushes the same heads into another. The combined model may regress on both tasks. Interference is more likely when adapters target the same modules at high rank.

A harder variant is the svd combination type: the delta matrices are summed, then re-factorised to rank r via SVD. This re-compresses the combined delta but loses the components below the rank cutoff. The cat type concatenates the B and A matrices, which doubles the rank (and may cause OOM at high ranks). The dare_linear and dare_ties methods add a pruning step first, dropping low-magnitude delta entries before merging, which reduces interference in practice (see the DARE paper, arxiv.org/abs/2311.03099).

Sequential stacking

Bottleneck adapters (Houlsby et al., 2019) insert small feed-forward modules inside transformer layers:

What "composition" actually means

Linear interpolation: the simplest case

Sequential stacking

Keep reading with Pro.