Catastrophic Forgetting in Fine-Tuning

Fine-tune GPT-style model on a single-domain dataset for a few epochs and you will likely ship a model that scores higher on that domain's benchmark while quietly forgetting how to perform arithmetic, answer general trivia, or even follow instruction formats it handled effortlessly before. The benchmark looks great; the production complaints arrive later.

This phenomenon is called catastrophic forgetting, and it is not a bug in any one training library. It is a structural property of how gradient descent updates shared weights.

What Actually Happens to the Weights

A pre-trained LLM encodes a vast number of capabilities in its weight matrix values. Each gradient step during fine-tuning moves weights to reduce loss on the new task. Because the same weights serve both old and new knowledge, a step that helps the new task can hurt the old one. There is no mechanism in standard stochastic gradient descent (SGD) or Adam to preserve previous capabilities unless you explicitly encode that constraint.

The interference is not random. Weights that are most important for the target task tend to be updated most aggressively. If those same weights were also load-bearing for a prior capability (say, general commonsense reasoning), that capability degrades proportionally.

A useful framing: think of the loss landscape as a high-dimensional surface with many minima. Pre-training finds a minimum that satisfies many tasks simultaneously. Full fine-tuning on one task moves the parameters toward a nearby minimum that satisfies the fine-tuning objective but may be far from the pre-training minimum for other tasks.

Formally, if \(\theta^*_\text{pre}\) is the pre-trained minimum and \(\theta^*_\text{ft}\) is the fine-tuned minimum, the divergence \(\|\theta^*_\text{ft} - \theta^*_\text{pre}\|\) grows with learning rate, dataset size, and number of epochs. There is no free lunch: the further you travel in parameter space, the more prior knowledge you can overwrite.

Why the Problem Is Worse at Scale

Counterintuitively, empirical work on models from 1B to 7B parameters has found that larger models sometimes exhibit more severe forgetting during continual instruction tuning, not less (Luo et al., 2023). One hypothesis is that larger models develop more specialised weight circuits during pre-training, so any given weight is more likely to be load-bearing for multiple capabilities. Disrupting those circuits during fine-tuning has a broader blast radius.

A second compounding factor is dataset imbalance. Fine-tuning corpora are typically orders of magnitude smaller and narrower than pre-training corpora. A few thousand domain-specific examples will push loss down sharply on that domain; the gradient signal from everything else is simply absent, so those capabilities drift.

There is also an architectural dimension. Decoder-only models appear to retain knowledge better across sequential fine-tuning than encoder-decoder models in some settings, though this result is architecture- and task-dependent enough that it should not be treated as a universal rule.

The Three Main Mitigation Strategies

1. Regularisation-based methods

Elastic Weight Consolidation (EWC) adds a quadratic penalty to the fine-tuning loss that resists movement of weights that were important for previous tasks:

\[\mathcal{L}_\text{EWC}(\theta) = \mathcal{L}_\text{task}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta^*_{\text{pre},i})^2\]

Here \(F_i\) is the diagonal Fisher information estimate for weight \(i\), a cheap proxy for how much that weight contributed to prior task performance (Kirkpatrick et al., 2017). Weights with high Fisher values get penalised heavily for moving; inexpensive weights are left free to adapt. EWC was demonstrated on sequential MNIST variants and Atari games, but scaling the Fisher diagonal to billion-parameter LLMs is expensive and the approximation quality degrades.

2. Replay and data mixing

A conceptually simpler approach: mix a fraction of pre-training-style data back into the fine-tuning corpus. If 10-20% of each training batch is sampled from a general-purpose dataset, the gradient signal continuously reinforces prior capabilities. This is sometimes called "experience replay" in the continual learning literature, and "data augmentation" in practice.

The tradeoff is data management complexity and potential contamination of the fine-tuning signal. If your general-purpose replay data overlaps with test sets for the pre-training tasks, you can inflate metrics without genuine retention.

3. Parameter-efficient fine-tuning (PEFT)

This is the most practically relevant mitigation for LLM practitioners. Methods like LoRA, adapters, and prefix tuning hold the pre-trained weights frozen and route adaptation through a small number of trainable parameters. Since the base weights do not move, they cannot be overwritten.

The empirical evidence is striking: Biderman et al. (2024, "LoRA Learns Less and Forgets Less") showed that LoRA fine-tuning substantially reduces forgetting of base model capabilities compared to full fine-tuning, at the cost of somewhat lower peak performance on the fine-tuning task itself. The title captures the tradeoff cleanly.

Approach	Forgetting risk	Peak task performance	Memory cost
Full fine-tuning	High	Highest	High
EWC	Medium	Medium	Medium (Fisher storage)
Replay/data mixing	Low-medium	High	Moderate (data pipeline)
LoRA / adapters	Low	Moderate-high	Low

PEFT methods are not magic; they reduce forgetting by constraining the adaptation to a low-rank subspace. Whether that subspace is expressive enough for your task is an empirical question answered by task performance, not by theory alone.

The Alignment Safety Dimension

Catastrophic forgetting has a particularly sharp edge in safety-critical fine-tuning. RLHF-trained models encode refusal and alignment behaviours in their weights. Domain fine-tuning on a narrow corpus can partially overwrite those behaviours even if the fine-tuning dataset contains no adversarial examples. The mechanism is the same: gradient steps on the task objective move weights that also encode safety constraints.

This motivates the use of PEFT for downstream fine-tuning of aligned models: freezing the base weights is one way to preserve alignment properties while still adapting to a new domain. It is not a complete defence, but it substantially reduces accidental degradation.

When It Falls Down

PEFT does not fully eliminate forgetting. At sufficiently high ranks or with long enough training, even LoRA can drift the effective weight distribution enough to degrade base capabilities. The "less forgetting" finding holds in typical low-rank, moderate-epoch settings.

EWC does not scale well. Computing and storing a per-weight Fisher diagonal for a 7B-parameter model requires memory proportional to model size, and the diagonal approximation of the true Fisher matrix becomes increasingly crude as model scale grows. Kronecker-factored approximations (K-FAC, KFAC-EWC) improve this but add engineering complexity.

Replay quality matters enormously. If the replay dataset is not representative of the pre-training distribution, you may prevent forgetting of the replayed tasks while still losing capabilities not covered by replay. Constructing a high-quality replay corpus for a frontier model is practically very difficult since pre-training data is often proprietary or too large to store.

Evaluation is unreliable. Most forgetting benchmarks measure a handful of capabilities (MMLU, HellaSwag, GSM8K). A model can score identically on these while having lost capabilities not covered by the benchmark. Forgetting is often invisible until a user finds the edge case in production.

Task similarity affects severity. Fine-tuning on a task similar to pre-training induces less forgetting than fine-tuning on an out-of-distribution domain. The relationship is not linear and is hard to predict without empirical measurement.