Model Merging: Linear and SLERP

You can train two models - one specialised for reasoning and one for creative writing - and merge them into a single checkpoint that costs nothing extra at inference time. No additional training, no architectural changes, no ensembling overhead. That is the practical promise of model merging, and it is now routinely used to build competitive open-weight models on Hugging Face.

This concept covers the two most widely used interpolation strategies: plain linear (weighted average) and SLERP (Spherical Linear Interpolation). Both operate entirely in weight space, require no labelled data, and produce a checkpoint identical in shape to either parent.

The geometry of fine-tuned weight space

A pre-trained model occupies a point theta_base in a very high-dimensional weight space (billions of dimensions for a 7B model). Fine-tuning nudges that point toward a region of the manifold that is good for some target behaviour. The key empirical observation, formalised in the "Task Arithmetic" paper (Ilharco et al., ICLR 2023), is that the delta between fine-tuned and base weights, called a task vector, is a meaningful direction:

tau = theta_ft - theta_base

Adding or scaling task vectors tends to combine capabilities in a surprisingly linear way, at least when the fine-tuned models share the same base checkpoint and were not driven too far from it.

This linearity is not guaranteed by theory; it is an empirical regularity rooted in the fact that modern pre-training leaves the loss landscape locally quite flat around theta_base. Models fine-tuned from the same base therefore often sit in the same loss basin, connected by low-loss paths - the precondition that makes weight averaging sensible.

Linear merging (weighted average)

Given two fine-tuned models A and B, the linear merge is:

theta_merged = alpha * theta_A + (1 - alpha) * theta_B

where alpha in [0, 1] controls the blend. When alpha = 0.5 this is a simple arithmetic mean, sometimes called a "model soup" after the Wortsman et al. (2022) paper that popularised this recipe for ensembling multiple hyperparameter runs of the same task.

For a full layer-by-layer merge you apply this independently to every weight tensor: attention projections, MLP weights, layer norms, embeddings. Because the merged tensor has exactly the same shape as either parent, the model can load and run immediately.

Practical properties:

Property	Notes
Speed	O(n) in parameter count; a few seconds on CPU
Memory	Requires loading both models simultaneously (~2x VRAM)
Number of models	Generalises to k models with k mixing coefficients summing to 1
Requires base model	No; can merge two fine-tunes directly

Searching alpha is cheap: you can sweep 10-20 values and evaluate on a held-out set in minutes. The Wortsman et al. model soups paper demonstrated that averaging models fine-tuned with different hyperparameter configurations consistently outperforms any individual model on CLIP and ViT benchmarks.

The geometry of fine-tuned weight space

Linear merging (weighted average)

Keep reading with Pro.