Applied LLMs
Merging LoRA into Base Weights
After LoRA training, the low-rank adapter matrices can be folded directly into the frozen base weights, eliminating inference overhead and adapter management complexity while producing a standard dense model.
intermediate · 7 min read
Serving a LoRA-adapted model without merging introduces a two-path forward pass: every token must traverse both the frozen base weight W and the low-rank branch BA, then sum the results. That summation is not free. At decode-time, where a single matrix-vector product per layer already dominates latency, an extra branch costs roughly 5-15% wall-clock overhead depending on rank and hardware. The merge operation eliminates that overhead entirely by making the two paths one.
What the merge computes
LoRA trains two low-rank matrices A (shape d_in x r) and B (shape r x d_out), both randomly initialised such that BA = 0 at the start of training. After training, the effective weight the model has learnt is:
W_merged = W + (alpha / r) * B @ A
alpha is the scaling hyperparameter set in LoraConfig; dividing by r makes the effective learning rate roughly rank-invariant. The resulting W_merged has exactly the same shape as the original frozen weight. No structural change to the model is needed: replace the old tensor in place and remove the adapter scaffolding.
In code (using HuggingFace PEFT):
from peft import PeftModel
# base_model is a standard Transformers model
peft_model = PeftModel.from_pretrained(base_model, "path/to/lora_adapter")
# Folds BA into every target module and strips the adapter objects
merged_model = peft_model.merge_and_unload()
# merged_model is now a plain Transformers model; save it like any other
merged_model.save_pretrained("path/to/merged_model")
merge_and_unload() is a single call that (a) computes W + scale * BA for every LoRA-targeted linear layer and (b) returns a plain base-model instance with no PEFT overhead. If you want to keep the ability to unmerge later (for adapter swapping in production), use merge_adapter() instead; this mutates the weights in-place but retains the LoRA parameter objects so unmerge_adapter() can reconstruct W.
The scaling constant and why it matters
A common source of silent bugs is confusing lora_alpha with the actual scale factor. The effective multiplier applied to BA is alpha / r, not alpha alone. A configuration of r=8, alpha=16 gives a scale of 2.0; changing to r=16, alpha=16 halves it to 1.0 with no change to alpha. This means two checkpoints trained with different ranks but identical alpha values will produce different merged weights if you apply the wrong scale.
| r | alpha | scale = alpha/r |
|---|---|---|
| 4 | 16 | 4.0 |
| 8 | 16 | 2.0 |
| 16 | 16 | 1.0 |
| 16 | 32 | 2.0 |
When writing custom merge code, always retrieve the scale from the adapter config rather than hard-coding it.
Merging multiple adapters and task arithmetic
Because the merge is just addition in weight space, you can combine more than one LoRA. Given adapters trained on tasks A and B (both starting from the same base), define the delta weights:
delta_A = scale_A * B_A @ A_A
delta_B = scale_B * B_B @ A_B
W_combined = W + delta_A + delta_B
This is the core idea behind task arithmetic (Ilharco et al., 2023, arXiv:2212.04089): a fine-tuned model minus its base yields a "task vector", and task vectors can be added, subtracted, or interpolated. It also underlies DARE (Yu et al., 2023, arXiv:2311.03099), which randomly drops a fraction of the delta parameters and rescales the rest before merging, using the observation that delta values in LLM fine-tunes are typically very small (often within 0.002 of zero) and highly redundant.
PEFT supports multi-adapter combination directly via add_weighted_adapter(), with combination_type options including "linear" (weighted sum), "ties" (sign-consensus selection), "dare_linear", and "dare_ties" for sparsified variants.
Practical considerations before you merge
Precision. If the base model was loaded in bfloat16 and the LoRA matrices trained in float32 (common when using mixed-precision), the accumulation W + scale * BA should be done in float32 before casting down. Doing the accumulation in bfloat16 can introduce rounding error large enough to shift the model's output distribution, particularly for small-rank adapters where the delta is already tiny.
Quantised bases (QLoRA). When the frozen weights are quantised (e.g., 4-bit NF4 via bitsandbytes), the adapter matrices live in float16/bfloat16 outside the quantised kernel. Merging requires dequantising W first, performing the addition in float16 or higher, and then re-quantising if you still want a quantised final model. The merge_and_unload() path in PEFT handles dequantisation automatically, but the re-quantisation step is not automatic; you get a float16 merged model unless you quantise it explicitly afterwards.
Multiple active adapters. If you have loaded several adapters under different names and only some are active, merge_and_unload() merges only the currently active adapter. Confirm which adapter is active before calling it.
When it falls down
Rank-precision mismatch with QLoRA. Merging a QLoRA adapter into its quantised base and then immediately re-quantising compounds two sources of rounding error: the quantisation noise from training and the noise from the merge arithmetic. For very small ranks (r=4 or below), the delta magnitude can fall below the representable precision of bfloat16, effectively cancelling the fine-tuning signal. LoftQ-style initialisations reduce this by minimising quantisation error from the start.
Adapter recycling is gone. Once merged and unloaded, the adapter cannot be unloaded, swapped, or combined with another adapter without re-training. For production systems that need to serve dozens of task-specific variants from a single base, merging every adapter is the wrong strategy; adapter serving (keeping adapters separate, loading them on demand) is the correct one.
Task interference on dissimilar tasks. Adding two task vectors works well when the tasks are compatible (e.g., two instruction-following datasets in the same language). When tasks place conflicting demands on the same weight directions, the combined model degrades on both. TIES and DARE mitigate this by zeroing out small or conflicting deltas, but they cannot eliminate interference entirely.
The base model version must match. Merging is only valid when the LoRA adapter was trained against the exact same base model checkpoint. Even a quantisation difference (bfloat16 base at training vs. float16 base at merge time) can produce a subtly wrong merged weight. Always record the base model SHA or commit hash alongside adapter checkpoints.
Scale bugs silently degrade quality. If the adapter was trained with use_rslora=True (rank-stabilised LoRA, which uses alpha / sqrt(r) as the scale), but the merge code applies alpha / r, every merged weight will be off by a factor of sqrt(r). The model will still produce fluent output but with degraded task performance. This is hard to catch without evaluation.
Further reading
- Hu et al. (2021), "LoRA: Low-Rank Adaptation of Large Language Models" - arXiv:2106.09685. The original paper; Section 4 details why the merge incurs zero inference overhead.
- HuggingFace PEFT documentation on LoRA, including
merge_and_unload,merge_adapter, andadd_weighted_adapter: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora - Ilharco et al. (2023), "Editing Models with Task Arithmetic" - arXiv:2212.04089. Foundational work on delta-weight arithmetic; directly motivates multi-adapter merging.
- Yu et al. (2023), "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE) - arXiv:2311.03099. Shows that 90-99% of delta parameters can be dropped before merging with negligible quality loss.