How Much Data Is Enough? The Chinchilla Correction
May 31, 2026 · 3 min read
Between 2020 and 2022, a strange consensus took hold: to make a language model better, make it bigger. The intuition had real evidence behind it. The Kaplan et al. scaling-law paper showed smooth, predictable improvements in loss as model size, data, and compute increased, and suggested that, given more compute, most of it should go into parameters (Kaplan et al., 2020, arXiv:2001.08361).
Then a team at DeepMind asked a sharper question: for a fixed compute budget, what is the optimal split between model size and training data? Their answer reshaped the field.
The compute-optimal frontier
The study, now universally called "Chinchilla," trained over 400 models across a range of sizes and data quantities to map the trade-off precisely (Hoffmann et al., 2022, Training Compute-Optimal Large Language Models, arXiv:2203.15556). Its central finding was blunt: the largest models of the era were significantly undertrained. For compute-optimal training, model size and training tokens should scale roughly in proportion, not with parameters dominating.
To prove the point, they trained Chinchilla, a 70-billion-parameter model, on far more data than the 280-billion-parameter Gopher, using the same compute budget. The smaller, better-fed model won across a broad evaluation suite.
| Model | Parameters | Training tokens (approx.) | Compute |
|---|---|---|---|
| Gopher | 280B | 300B | baseline |
| Chinchilla | 70B | 1.4T | about the same as Gopher |
A quarter of the parameters, roughly five times the data, the same compute, better results. The headline was not "small is beautiful." It was "balance is efficient."
Why this mattered beyond the leaderboard
The Chinchilla correction had consequences that compounded for years:
- Inference economics. A smaller compute-optimal model is cheaper to serve, not just to train. Since a deployed model is queried far more often than it is trained, this shifted the entire cost calculus toward smaller, data-rich models.
- Data became the bottleneck. If optimal training demands tokens roughly in proportion to parameters, then the supply of high-quality text becomes a first-class constraint. This reframed data curation, deduplication, and filtering as central research problems rather than preprocessing chores.
- A new default recipe. Subsequent open model families were trained on token counts far exceeding the old Kaplan-era ratios, often deliberately past the compute-optimal point to make inference cheaper still.
The limits of the law
Scaling laws are empirical regularities, not laws of nature, and the Chinchilla result carries its own caveats:
- The optimal ratio was derived under specific assumptions about architecture, data quality, and the loss being measured. Change the data distribution and the frontier moves.
- "Compute-optimal for training" is not "optimal for deployment." Teams routinely train past the Chinchilla point, accepting diminishing training returns in exchange for a smaller, cheaper-to-run model.
- Loss is not capability. Lower pretraining loss correlates with better downstream behaviour, but the mapping is noisy, especially for emergent or reasoning-heavy tasks.
The quiet revolution
What makes Chinchilla a landmark is not a single number but a shift in posture. It replaced "scale parameters until it works" with "measure the frontier, then spend deliberately." That is the difference between alchemy and engineering, and much of the maturation of large-model training since 2022 has followed from taking the question seriously: how much data is enough?
Sources and further reading
- Kaplan et al. (2020), Scaling Laws for Neural Language Models arXiv:2001.08361
- Hoffmann et al. (2022), Training Compute-Optimal Large Language Models arXiv:2203.15556
- Background: Neural scaling law, Wikipedia