Applied LLMs
Model Soups
Model soups average the weights of multiple independently fine-tuned checkpoints to produce a single model that outperforms any individual checkpoint without increasing inference cost.
intermediate · 7 min read
Averaging numbers is easy; averaging neural network weights is supposed to destroy everything. Two models trained from the same pretrained base but with different random seeds or hyperparameters typically sit in different loss basins, and naive interpolation should land you in a valley between them with high loss. Yet Wortsman et al. (ICML 2022) showed that fine-tuned models do not behave this way: if they all start from the same pretrained checkpoint, their weights can be averaged and the resulting "soup" routinely beats every individual ingredient on both in-distribution accuracy and out-of-distribution robustness.
The ViT-G soup they demonstrated reached 90.94% top-1 accuracy on ImageNet - state of the art at the time - without any additional training or inference cost. That result upended the conventional wisdom that you must pick one checkpoint and discard the rest.
Why weight averaging works here
The key is shared initialisation. Standard gradient descent from two independent random initialisations leads to solutions in topologically disconnected regions of loss space - models whose weights cannot be interpolated without crossing a high-loss barrier (Fort et al., 2019 made this explicit for ensembles). Fine-tuning from a large, converged pretrained checkpoint is different: every run starts in the same basin. Hyperparameter variation (learning rate, augmentation strength, label smoothing) moves each run to a slightly different low-loss flat region, but those regions are close enough that the straight line between any two of them stays low in loss.
This property is called linear mode connectivity: two solutions are linearly mode-connected when the loss along the linear interpolation between them does not rise materially above the endpoints. Large pretrained models exhibit it reliably after fine-tuning; small models trained from scratch generally do not.
The practical consequence is that you can think of fine-tuned checkpoints as points in a flat plateau of the loss landscape. Their centroid - the arithmetic mean of the weight tensors - is also in that plateau, and it benefits from variance reduction. Each hyperparameter run is a noisy estimate of the optimal direction; averaging suppresses the noise.
The greedy soup algorithm
Averaging all checkpoints indiscriminately can hurt performance if some ingredients are poor. The original paper proposes a simple greedy construction:
- Rank all candidate checkpoints by validation accuracy.
- Start with the best single checkpoint as the current soup.
- For each remaining checkpoint (in descending order of solo accuracy), tentatively add it to the soup by updating the running average.
- Keep the update only if the new soup's validation accuracy is at least as good as the previous soup.
soup = checkpoint[0] # best solo model
for ckpt in checkpoints[1:]:
candidate = mean(soup, ckpt) # arithmetic mean of weight tensors
if val_acc(candidate) >= val_acc(soup):
soup = candidate
This O(N) scan over N checkpoints costs one forward pass per candidate and avoids committing to checkpoints that introduce interference. The final model is a single set of weights - no ensembling at inference time.
Connecting to task arithmetic and alignment merging
The model-soup insight seeded a broader research programme on weight-space editing. Ilharco et al. (ICLR 2023) formalised "task vectors" - the delta between a fine-tuned model and its pretrained base - and showed these deltas can be composed arithmetically: adding task vectors combines capabilities, negating a vector selectively suppresses a skill.
In the alignment context this matters concretely. Suppose you run DPO on a helpfulness dataset and separately run SFT on a coding dataset. Both fine-tunes start from the same base model. Rather than picking one or iteratively fine-tuning, you can:
theta_merged = theta_base
+ alpha * (theta_helpfulness - theta_base)
+ beta * (theta_coding - theta_base)
The scalars alpha and beta control how much of each "ability" to inject. This is weight-space linear interpolation, and it works because both fine-tunes are in the same linear-mode-connected neighbourhood of the base.
Two methods address interference between task vectors when the deltas conflict at the same parameter positions:
- TIES-Merging (Yadav et al., NeurIPS 2023): trims parameters whose magnitude change is small (likely noise), resolves sign conflicts by majority vote, then averages only the sign-consistent parameters.
- DARE (Yu et al., ICML 2024): randomly drops a large fraction (90-99%) of the delta parameters before merging, then rescales the survivors to preserve the expected magnitude. The extreme sparsity of fine-tuning deltas means most dropped values were near zero and contribute little information anyway.
Both methods improve over uniform averaging on multi-task merging benchmarks, particularly when merging many models simultaneously.
WiSE-FT: interpolating for robustness
A practically important special case arose before the "soups" paper itself: WiSE-FT (Wortsman et al., CVPR 2022). Fine-tuning a CLIP model on a target dataset improves in-distribution accuracy but degrades out-of-distribution (OOD) robustness - the model forgets the general representations that made the zero-shot model robust. WiSE-FT simply interpolates between the zero-shot weights and the fine-tuned weights:
theta_wise = (1 - alpha) * theta_zero_shot + alpha * theta_fine_tuned
At alpha = 0.5, ImageNet accuracy improved by 1.6 percentage points over the zero-shot baseline while OOD robustness improved by 4-6 percentage points over vanilla fine-tuning. The merged model is strictly better than either endpoint on the joint objective.
This interpolation is now a standard recipe when fine-tuning large vision-language models: never discard the pretrained weights entirely.
| Method | Key operation | Main benefit |
|---|---|---|
| Uniform soup | Mean of all checkpoints | Variance reduction |
| Greedy soup | Conditional mean, accuracy-gated | Avoids bad ingredients |
| WiSE-FT | Interpolate zero-shot + fine-tuned | Preserves OOD robustness |
| Task arithmetic | Sum of scaled task vectors | Multi-ability composition |
| TIES-Merging | Trim + sign-resolve + merge | Reduces parameter interference |
| DARE | Sparse drop + rescale deltas | Scales to many merged models |
When it falls down
Models must share the same architecture and pretrained base. You cannot average a LLaMA-3-8B checkpoint with a Mistral-7B checkpoint - the weight tensors have different shapes and their values live in incommensurable spaces. Even two LLaMA-3-8B runs are only safely mergeable if they started from the same pretrained checkpoint; diverged bases break linear mode connectivity.
Long, divergent fine-tuning breaks connectivity. The linear mode connectivity assumption weakens as fine-tuning continues. A checkpoint trained for many epochs with a high learning rate on a small, narrow dataset may have moved far enough from the pretrained basin that averaging with other runs increases loss. PEFT methods (LoRA, DoRA) partially mitigate this: because most of the original weights are frozen, the "effective distance" between fine-tuned models remains small.
Validation leakage is a real risk. Greedy soup selection uses a validation set to decide which checkpoints to include. If the validation set is the same one used to tune hyperparameters, you may select a soup that is overfit to that partition. The gains may not generalise to the held-out test set or deployment distribution.
Scaling to very large models is expensive. Each candidate ingredient requires a full forward pass to evaluate. For a 70B-parameter model, even a handful of evaluation passes is non-trivial. The weight tensors themselves are large: storing N candidate checkpoints requires N times the VRAM or disk space during the selection phase.
Sign conflicts grow with N. When merging many models, parameter signs increasingly disagree. TIES and DARE help, but no current method fully recovers the performance of individually fine-tuned models on all tasks simultaneously when N is large. The merged model is a compromise, and the compromise worsens as tasks diverge semantically.
Further reading
- Wortsman et al., "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," ICML 2022: https://arxiv.org/abs/2203.05482
- Ilharco et al., "Editing Models with Task Arithmetic," ICLR 2023: https://arxiv.org/abs/2212.04089
- Yadav et al., "TIES-Merging: Resolving Interference When Merging Models," NeurIPS 2023: https://arxiv.org/abs/2306.01708
- Yu et al., "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE), ICML 2024: https://arxiv.org/abs/2311.03099