← Concept library

Architectures & Scaling

Data Mixtures and Domain Weighting

Domain weighting determines how much of each data source a model sees during pretraining, and getting this wrong can cost tens of thousands of GPU-hours or silently cripple downstream task performance.

advanced · 9 min read · Premium

GPT-3's pretraining corpus was roughly 60% Common Crawl, 22% WebText2, 8% Books1, 8% Books2, and 3% Wikipedia. Those percentages were set by hand, informed by intuition and ablation runs on smaller models. For years, this "best guess with post-hoc ablations" approach was the industry norm, even as training runs crossed tens of thousands of GPU-hours. The question of how much web text versus code versus curated books a model should see turns out to be non-trivial: it shapes vocabulary breadth, factual accuracy, reasoning ability, and toxicity in ways that interact with model scale in surprising ways.

What a data mixture actually is

Before training begins, a pretraining corpus is partitioned into labelled domains. A domain can be as coarse as "web text" or as fine-grained as "StackOverflow Python answers". Each domain \(d_i\) is assigned a weight \(w_i \geq 0\) with \(\sum_i w_i = 1\). During training, at each step the dataloader samples a domain according to those weights, then draws a random batch from that domain's shard.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied