Model Collapse from Recursive Training

By the time GPT-4-class models had been deployed publicly, a substantial fraction of new text appearing on the internet was already generated by models trained on earlier internet text. Feed that new web corpus back into the next training run, and you have a closed loop. Shumailov et al. (2023) named the resulting degradation model collapse: "use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear."

The word "irreversible" is doing real work there. This is not a temporary calibration error you can anneal away; it is a structural loss of distributional knowledge that compounds with each generation.

What collapse actually destroys

A language model does not store facts as key-value entries. It encodes a probability distribution over sequences. The high-probability region, broadly, is fluent, common-sense text. The low-probability tails hold rare languages, specialist vocabulary, unusual but valid sentence constructions, minority viewpoints, and long-tail factual knowledge.

When you sample from generation-\(n\) and use those samples as training data for generation-\((n+1)\), you introduce a systematic bias: you over-represent whatever the model was already confident about, and under-represent everything near the tails. The next model learns from this biased corpus, becoming even more confident about the centre and even less aware of the periphery. Repeat.

Formally, let \(p_0\) be the true data distribution and \(\hat{p}_n\) the model's distribution at generation \(n\). Each generation trains on samples from \(\hat{p}_{n-1}\), introducing two compounding error sources:

Approximation error - the model at generation \(n-1\) is itself an imperfect estimate of \(p_0\).
Sampling error - drawing a finite sample from \(\hat{p}_{n-1}\) adds additional variance, which is worst in the tails.

The result is that the variance of \(\hat{p}_n\) shrinks each generation. In a Gaussian toy model, after \(k\) generations the estimated variance \(\hat{\sigma}^2_k \approx \hat{\sigma}^2_0 - k \cdot \delta\) where \(\delta > 0\) depends on the ratio of synthetic to real samples. The distribution collapses inward. This is not a metaphor; it is literally what the maths shows.

The three-phase phenomenology

Empirical studies across VAEs, Gaussian mixture models, and LLMs (Shumailov et al., 2023) consistently show three qualitative phases:

Phase	Symptom	What is lost
Early (gen 1-3)	Subtle stylistic homogenisation; less lexical diversity	Low-frequency tokens and constructions
Middle (gen 4-10)	Factual drift; hallucination of "median" facts	Rare but true information
Late (gen 10+)	Output degenerates toward repetitive, incoherent text	Most of the tail; model is effectively broken

The boundary between phases depends on the fraction of synthetic data in each training run, the model's expressivity, and whether any fresh real data is mixed in. With 100% synthetic replacement, collapse reaches the late phase in as few as five generations for smaller models.

What collapse actually destroys

The three-phase phenomenology

Keep reading with Pro.