Data Mixtures and Domain Weighting

GPT-3's pretraining corpus was roughly 60% Common Crawl, 22% WebText2, 8% Books1, 8% Books2, and 3% Wikipedia. Those percentages were set by hand, informed by intuition and ablation runs on smaller models. For years, this "best guess with post-hoc ablations" approach was the industry norm, even as training runs crossed tens of thousands of GPU-hours. The question of how much web text versus code versus curated books a model should see turns out to be non-trivial: it shapes vocabulary breadth, factual accuracy, reasoning ability, and toxicity in ways that interact with model scale in surprising ways.

What a data mixture actually is

Before training begins, a pretraining corpus is partitioned into labelled domains. A domain can be as coarse as "web text" or as fine-grained as "StackOverflow Python answers". Each domain \(d_i\) is assigned a weight \(w_i \geq 0\) with \(\sum_i w_i = 1\). During training, at each step the dataloader samples a domain according to those weights, then draws a random batch from that domain's shard.

The simplest weighting scheme is proportional: set \(w_i = |D_i| / \sum_j |D_j|\) where \(|D_i|\) is the token count of domain \(i\). This is how you would train if you simply concatenated everything and shuffled. Proportional weighting means a 10-trillion-token web crawl naturally overwhelms a 50-billion-token books corpus, which might be fine or catastrophic depending on what you want the model to do.

Oversampling is common for high-quality but small sources. If books are 0.5% of raw tokens but you want the model to reason carefully, you might upsample them to 5-10% of training steps, effectively training on each book token 10-20 times. The Chinchilla scaling laws (Hoffmann et al., 2022) showed that for a fixed compute budget, the optimal model trains on far more tokens than earlier practice suggested -- around 20 tokens per parameter. Once you adopt that budget, you often have to repeat high-quality data while using web text only once.

# Simplified sampling logic used by most training frameworks
domain_weights = {"web": 0.67, "code": 0.15, "books": 0.10, "wiki": 0.08}

def sample_batch(domain_weights, domain_shards, batch_size):
    domain = random.choices(
        list(domain_weights.keys()),
        weights=list(domain_weights.values()),
        k=1
    )[0]
    return next(domain_shards[domain])  # pre-shuffled shard iterator

The weights are a hyperparameter. Like learning rate, they interact with scale in ways that make small-model experiments imperfect proxies for large-model behaviour.

Why uniform or proportional weighting is often wrong

Proportional weighting optimises for reducing average loss across the corpus. But average loss is dominated by the largest domain. If web text is 80% of tokens and Wikipedia is 1%, the model can ignore Wikipedia almost entirely and still see great average loss numbers. The result is a model that writes fluently but hallucinates basic facts.

More formally, suppose you care about a downstream task whose data distribution is concentrated in domain \(d_k\). Proportional weighting allocates compute to \(d_k\) roughly proportional to \(|D_k| / \sum_j |D_j|\). If \(d_k\) is small, the model gets limited exposure to it, and generalisation to that task suffers regardless of how many total tokens you train on.

What a data mixture actually is

Why uniform or proportional weighting is often wrong

Keep reading with Pro.