Mixing Synthetic and Human Data

When Meta trained Llama 2-Chat, roughly 27,540 of the supervised fine-tuning examples were human-annotated; a large additional slice came from rejection-sampled synthetic completions. The model that shipped to the public was not trained on human data alone, nor on synthetic data alone. That choice, how much of each, in what ratio, at what stage, reflects a genuine engineering tension that every serious alignment or fine-tuning effort must confront.

Why neither source alone is enough

Human-annotated data carries irreplaceable signal: annotators catch subtle tone failures, cultural nuance, and edge-case reasoning that a language model cannot reliably generate for itself. The problem is cost and coverage. Hiring skilled annotators to cover every subdomain, every language, every safety edge case is economically infeasible at the scale post-training now requires. A 70B model's instruction fine-tuning corpus may need hundreds of thousands of diverse examples.

Synthetic data, generated by a capable teacher model (or by the student itself via self-improvement), solves the coverage problem cheaply. Self-Instruct (Wang et al., 2022) showed you can bootstrap 52,000 instruction-following examples from GPT-3 seed tasks for under $500 in API costs, and the resulting Alpaca model matched text-davinci-003 in human preference evaluations on many tasks. The catch is that synthetic data inherits the teacher's blind spots, its output distribution skews toward patterns the generator finds easy, and it cannot introduce genuinely novel knowledge the teacher did not already possess.

The practical resolution is a portfolio: use human data for critical, difficult-to-automate signal, and synthetic data to fill breadth and volume. The question is how to allocate between them.

Three modes of mixing

Parallel mixing is the simplest approach. You assemble a dataset with a fixed human-to-synthetic ratio and train on the mixture. Stanford's Alpaca used 52k synthetic demonstrations built from 175 human seed tasks. Constitutional AI (Bai et al., 2022) uses a phase where all fine-tuning targets are AI-generated critiques and revisions, but those revisions are evaluated and filtered by humans before training. In practice, even a small proportion of carefully selected human examples can anchor the distribution. LIMA (Zhou et al., 2023, arXiv:2305.11206) demonstrated that 1,000 curated human examples fine-tuned into a 65B base could match models trained on orders-of-magnitude more synthetic data, because the human examples were quality-selected rather than volume-maximised.

Stage-wise mixing separates human and synthetic data across training phases. A common pattern:

Pretrain or continue-pretrain on a large corpus (mostly web text, some human-curated books).
Supervised fine-tuning (SFT) on human demonstrations for core behaviour.
Reward model training on human preferences.
Reinforcement learning from AI feedback (RLAIF), where a language model acts as a preference judge, generating synthetic preference labels at scale.
Optional rejection-sampling fine-tuning (RFT) on high-reward model completions.

Llama 2 roughly followed this pattern, layering AI-generated preference data on top of a human-seeded reward model. This is not naive sequential replacement: the human-trained reward model provides the evaluation standard that constrains what the synthetic AI judge is optimising toward.

Iterative self-improvement loops the model through generate-filter-train cycles. The Self-Rewarding Language Models paper (2024, arXiv:2401.10020) showed that a single model can serve as its own judge, generating candidate responses and scoring them with LLM-as-a-Judge prompting, then training on high-scoring completions. The requirement for human data does not disappear; it shrinks to the set of seed preference demonstrations used to initialise the judge's evaluation criteria.

Mixing mode	Human data role	Synthetic data role	Key risk
Parallel	Volume filler + anchor	Coverage, diversity	Ratio mis-calibration
Stage-wise	Seed reward / SFT	Scale out later stages	Reward hacking
Self-improvement	Judge initialisation	Training targets	Compounding teacher errors

Calibrating the ratio

There is no universal correct ratio. Several variables drive the choice:

Teacher quality. If your synthetic generator is a frontier model (GPT-4, Claude 3), its outputs on routine instruction-following tasks may be indistinguishable from strong human annotations in that domain. If it is a smaller model, quality falls off sharply on tasks requiring multi-step reasoning or domain expertise.

Task coverage. Safety-critical behaviours (refusing harmful requests, handling ambiguous consent) should be human-anchored. A model trained to refuse only on AI-generated harm scenarios will cover the scenarios the AI thought to imagine; adversarial red-teamers will find the gaps. Broad factual recall, format following, and code generation can be more aggressively synthetic.

Diversity targets. The main advantage of human-generated data is distributional breadth reflecting real user needs. Synthetic data generators, left unconstrained, oversample the modes they find natural. Persona prompting and EvolInstruct-style mutation attempt to counteract this, but they are not substitutes for observing actual users.

A rough empirical heuristic from post-training practice: start with all available high-quality human demonstrations for the core behaviour, then supplement with synthetic data up to 5-20x that volume, with a quality filter (rejection sampling) that keeps only completions scoring above a human-calibrated threshold.

The constitutional loop as a structured mixing strategy

Constitutional AI (CAI) is worth treating separately because it defines a principled schedule rather than an ad hoc ratio. The supervised phase generates critiques and revisions entirely from the model itself, guided by a set of human-authored principles (the "constitution"). No human labels harmful outputs; humans only author the normative principles. A second RL phase uses AI preference judgements derived from those principles as the reward signal (RLAIF). The mixing here is qualitative rather than quantitative: the human contribution is the standard, not the examples.

This matters because it shows that reducing the volume of human data does not necessarily mean reducing its authority over the final model's behaviour. Well-chosen human seed material, placed at the right point in the training pipeline, can shape a large synthetic corpus.

When it falls down

Ratio over-optimisation. If you increase the synthetic fraction past the point where the quality filter becomes the binding constraint, you start training on noise. Rejection sampling holds quality only if your reward model was itself well-calibrated on human preferences. Reward hacking is not hypothetical: models learn to produce outputs that score well on the learned reward function, which diverges from actual human judgement as the reward model extrapolates beyond its training support.

Mode collapse in the teacher. A teacher model has blind spots and stylistic habits. Training on its outputs at high volume concentrates the student's output distribution. The phi-1 "textbooks are all you need" result (arXiv:2306.11644) showed that GPT-3.5-generated code textbooks could train a strong 1.3B code model, but the downstream model learned the specific idioms and problem types the teacher favoured. Tasks outside that envelope degraded sharply.

Self-consuming loops without fresh human data. This is the deepest failure mode. The "Self-Consuming Generative Models Go MAD" paper (Alemohammad et al., 2023, arXiv:2307.01850) proved analytically and empirically that iterative training on purely synthetic data progressively collapses model quality or diversity, a phenomenon labelled "Model Autophagy Disorder (MAD)." Each generation loses some probability mass from the tails of the real distribution, and the next generator never corrects for what the previous one already lost. The fix is straightforward in principle: inject fresh real data at every generation. In practice, this constrains how aggressively you can reduce human annotation costs over time.

Human data as a false safety net. Conversely, small volumes of human data added to a large synthetic corpus may not meaningfully steer the distribution if the mixing is naive. A 1% human fraction in a 99% synthetic corpus is likely drowned out during training unless the human examples are upweighted or placed at a privileged point in the curriculum.

Label shift between synthetic and human preferences. Human annotators and AI judges do not share the same preference function. Disagreements are systematic: AI judges tend to favour longer, more verbose responses; human annotators show recency bias; neither generalises perfectly. Training on a mix of human and AI preference labels without accounting for this shift introduces noise that may not average out.

Why neither source alone is enough

Three modes of mixing

Calibrating the ratio

The constitutional loop as a structured mixing strategy

When it falls down

Further reading