Scaling Laws and the Chinchilla Result

The reason anyone spends tens of millions of dollars on a single training run with confidence is that the payoff is predictable. Language-model loss does not improve erratically as you add compute; it falls along a smooth power law, measurable at small scale and extrapolated to large. Scaling laws are what turned "train a bigger model and hope" into an engineering discipline with a spreadsheet.

The power law

Kaplan et al., 2020 measured how test loss depends on three quantities: model size N (parameters), dataset size D (tokens), and compute C. Across orders of magnitude, loss falls as a power law in each:

L(N) ≈ (N_c / N)^alpha_N

with small positive exponents, and similarly for D and C. The practical content is that plotting loss against compute on log-log axes gives a straight line, so you can fit the line on cheap small runs and read off what a run 1000x larger will cost and achieve. This predictability is why frontier labs run scaling-law sweeps before committing to a large model; the big run is an extrapolation, not a gamble.

The Chinchilla correction

Kaplan's work suggested that when you get more compute, most of it should go into a bigger model and comparatively little into more data. The field followed that advice, and GPT-3-era models were enormous and, in hindsight, badly under-trained. Hoffmann et al., 2022, the Chinchilla paper, redid the analysis more carefully and reached a different optimum: for a fixed compute budget, model size and training tokens should scale in roughly equal proportion. Their rule of thumb, about 20 training tokens per parameter, implied that most large models of the day were too big for the data they saw.

The demonstration was pointed. Chinchilla had 70 billion parameters, about a quarter of Gopher's 280 billion, but was trained on 4x more data for the same compute, and it outperformed Gopher across the board. Same budget, smaller model, more tokens, better result. The lesson reshaped model design overnight: the race stopped being purely about parameter count and started being about data scale and quality.

Why compute-optimal is not deployment-optimal

Chinchilla answers "what is the best model for a fixed training budget." That is often the wrong question. If a model will serve billions of inference requests, you may deliberately overtrain a smaller model past its Chinchilla-optimal point: it costs more to train per unit of quality, but a smaller model is cheaper on every one of those billions of forward passes. This is exactly the reasoning behind the Llama models, which are trained on far more than 20 tokens per parameter because inference cost, not training cost, dominates their lifetime economics (see knowledge-distillation for the related "small model, more effort" logic).

The power law

The Chinchilla correction

Why compute-optimal is not deployment-optimal

Keep reading with Pro.