← Concept library

Architectures & Scaling

Curriculum and Data Ordering

How the sequence and mixture proportions of training batches affect what an LLM learns and when, and why naive i.i.d. sampling often leaves capability on the table.

intermediate · 8 min read

Imagine training on a trillion tokens and spending the first hundred billion entirely on web-crawl noise before the model ever sees a line of code or a mathematical proof. The model will learn general English fluency, but it will also bake in a strong prior toward informal prose that takes far longer to override once specialised data arrives. This is the core intuition behind curriculum and data ordering: the order and rate at which a model sees different data distributions shapes both final capability and the trajectory of learning.

Why Ordering Matters at All

Neural language models are not tabula rasa at each gradient step. They carry an accumulated weight state, and every batch nudges that state toward the implicit distribution of that batch. When the same total tokens are presented in different sequences, the gradient path through parameter space differs, and different parameter configurations converge to a minimum.

Three mechanisms make ordering non-trivial:

  1. Gradient interference. Batches from very different domains (e.g., Python code versus medieval Latin manuscripts) can produce conflicting gradient directions. Interleaving them forces the optimiser to find a compromise that generalises; pure sequential training on domain A then domain B risks overwriting A-specific circuits.

  2. Learning rate coupling. Most large-scale runs use a cosine schedule or a trapezoidal schedule with a steep decay at the end. Data seen during the high-learning-rate phase is updated more aggressively and has more lasting influence on the final weights. If code is introduced only during cooldown, the model will have far less opportunity to internalise it than if code were present throughout.

  3. Chinchilla-era token budgets. Hoffmann et al. (2022) showed that compute-optimal runs scale model size and token count together. This means a fixed token budget is genuinely scarce, and the fraction of that budget allocated to each domain is a first-class hyperparameter, not an afterthought.

The Four Levers of Curriculum Design

1. Domain Mixture Weights

The simplest form of curriculum is a static mixture: sample from each data source according to a fixed probability. Llama 2, for example, drew from web text, code, books, Wikipedia, and scientific papers, each weighted manually. The weights are usually set by a combination of ablation experiments and intuition about downstream task importance.

A rough framing:

Source Typical weight range Primary skill it confers
Web crawl (filtered) 40-70% General language, breadth
Code repositories 5-20% Structured reasoning, syntax
Scientific/technical 5-15% Precise vocabulary, notation
Books / long-form 5-15% Long-range coherence
Curated high-quality 5-10% Writing quality signal

These are not universal; numbers vary widely across labs and depend on the model's intended use.

2. Dynamic Reweighting Over Time

Static weights are a strong baseline but sub-optimal. A common improvement is to shift mixture weights during training: start with broader web data, then increase the fraction of high-quality and specialised sources toward the end of training. This mirrors the intuition from human learning - build vocabulary and world knowledge first, then deepen into specialised domains.

The Llama 3 technical report (Dubey et al., 2024) describes a multi-stage approach where certain capability-specific data is upsampled during later training phases, though precise schedules are not always released publicly.

One formal version of dynamic reweighting is DoReMi (Xie et al., 2023), which learns domain weights by training a small reference model and a "proxy" model, then setting weights inversely proportional to how much worse the proxy is on each domain relative to the reference. This converts domain weighting from a manual decision to an optimisation problem.

3. Difficulty Ordering (Classical Curriculum Learning)

The classical curriculum learning hypothesis (Bengio et al., 2009) holds that starting on easier examples and progressing to harder ones accelerates convergence and improves generalisation. For language models, "difficulty" is ill-defined, but proxies include:

  • perplexity under a smaller reference model (high perplexity = harder)
  • token count or syntactic complexity
  • deduplication score (near-duplicates are "easy" in the sense of adding no new information)

In practice, large LLM runs rarely use strict difficulty schedules; the benefit is modest at scale compared to mixture design. But removing the easiest redundant data (deduplication) reliably helps, partly because it is a mild form of anti-curriculum: forcing the model to see diverse examples rather than repeating the trivial.

4. The Cooldown Phase

Nearly all modern LLM training runs end with a cooldown: the learning rate is annealed to near zero over a final slice of tokens, often 5-10% of the total budget. This phase is disproportionately powerful because gradients are small and each update is conservative - the model is effectively doing fine consolidation.

A common engineering trick is to upweight the highest-quality data during cooldown: curated books, top-rated scientific text, carefully filtered instruction-following examples. The FineWeb-Edu dataset (Penedo et al., 2024), for instance, is a 1.3-trillion-token corpus of educational content specifically constructed to be dense in the kind of knowledge one wants a model to consolidate during this phase.

The pseudo-code for a simple staged training loop looks like:

for step, batch in enumerate(dataloader):
    # Phase 1: steps 0..0.8*T  - broad web corpus, standard mix
    # Phase 2: steps 0.8*T..0.95*T - upweight code, science, books
    # Cooldown: steps 0.95*T..T  - high-quality only, LR -> 0

    loss = model(batch).loss
    loss.backward()
    optimizer.step()
    scheduler.step()   # cosine or trapezoidal

The implementation is straightforward; the hard part is deciding the weights and boundaries.

Sequence-Level Packing and Its Hidden Ordering Effect

When training on batches, individual documents are concatenated into fixed-length sequences with a special separator token. The order in which documents are packed matters subtly: if semantically related documents appear in the same packed sequence, the model sees cross-document attention over related content, which can help or hurt depending on whether those relationships are genuine.

Most pipelines shuffle documents globally before packing, which approximates i.i.d. sampling. However, some runs deliberately co-pack related documents (e.g., the same GitHub repository's files together) to reinforce structural coherence. Whether this helps at scale is an active area of study.

When It Falls Down

1. The distribution shift trap. If the cooldown high-quality corpus is too different from the pre-cooldown distribution, the model can show degradation on the originally learned tasks. This is essentially catastrophic forgetting in miniature. Replaying a small fraction of the original distribution during cooldown is a mitigation, but it adds complexity.

2. Over-engineering small experiments. Curriculum tricks found to help on 7B models sometimes do not transfer to 70B+ models because the larger model is more data-efficient and can learn from noisier signals without curriculum scaffolding. Lab ablations at small scale can be misleading guides for production runs.

3. Evaluation leakage and goodhart's law. If you tune mixture weights against benchmark scores (MMLU, ARC, etc.) rather than held-out log-perplexity, you risk optimising for the benchmark rather than general capability. The model learns the shape of the evaluation distribution, not a generalisable skill. Decontamination of benchmarks from training data is a separate but related concern.

4. Conflicting objectives in code models. Models trained primarily on code with a hard curriculum (code early and often) sometimes show weaker general language calibration, producing technically precise but oddly terse or brittle prose. The mixture needs to be genuinely mixed, not code-first then language, unless the intended use case is exclusively code.

5. Unknown interactions with tokeniser distribution. The tokeniser was trained on a particular corpus, and its fertility (tokens per word) varies across languages and domains. A data ordering that looks balanced in bytes may be very unbalanced in tokens. When composing multilingual curricula in particular, token-level balancing rather than byte-level balancing is the correct unit.

Further Reading

Sign in to save and react.
Share Copied