Large Language Models
Scaling Laws and the Chinchilla Correction
How loss falls predictably with compute, parameters, and data, and why the Chinchilla result showed almost every large model of its era was badly undertrained.
advanced · 10 min read · Premium
For most of deep learning's history, you could not say in advance what a bigger model would do; you trained it and found out. Scaling laws changed that. They are empirical power-law relationships, measured across many orders of magnitude, that let you predict a model's loss from three inputs: how much compute you spend, how many parameters it has, and how many tokens it sees. Once loss is predictable, the central question of frontier training becomes an optimisation problem: given a fixed compute budget, how should you split it between a bigger model and more data? The field got that answer badly wrong, then corrected it, and the correction reshaped how every lab spends its GPUs.
The Kaplan power laws
Kaplan et al. (2020) showed that test loss falls as a power law in model size, dataset size, and compute, with clean trends spanning more than seven orders of magnitude. Two findings drove practice. First, scale was smooth and predictable: no magic thresholds, just a straight line on a log-log plot. Second, and more provocatively, larger models were far more sample-efficient, so the compute-optimal move appeared to be training very large models on a relatively modest number of tokens and stopping well before convergence. The takeaway the field absorbed was blunt: spend your compute on parameters.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.