Architectures & Scaling
Data Mixtures and Domain Weighting
Domain weighting determines how much of each data source a model sees during pretraining, and getting this wrong can cost tens of thousands of GPU-hours or silently cripple downstream task performance.
advanced · 9 min read · Premium
GPT-3's pretraining corpus was roughly 60% Common Crawl, 22% WebText2, 8% Books1, 8% Books2, and 3% Wikipedia. Those percentages were set by hand, informed by intuition and ablation runs on smaller models. For years, this "best guess with post-hoc ablations" approach was the industry norm, even as training runs crossed tens of thousands of GPU-hours. The question of how much web text versus code versus curated books a model should see turns out to be non-trivial: it shapes vocabulary breadth, factual accuracy, reasoning ability, and toxicity in ways that interact with model scale in surprising ways.
What a data mixture actually is
Before training begins, a pretraining corpus is partitioned into labelled domains. A domain can be as coarse as "web text" or as fine-grained as "StackOverflow Python answers". Each domain \(d_i\) is assigned a weight \(w_i \geq 0\) with \(\sum_i w_i = 1\). During training, at each step the dataloader samples a domain according to those weights, then draws a random batch from that domain's shard.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.