Architectures & Scaling
Synthetic Data for Code
Generating synthetic code training data via instruction synthesis, distillation, and execution-based filtering lets small models punch well above their weight, but only when a reliable verifier anchors the loop.
intermediate · 7 min read
A 1.3-billion-parameter model trained on synthetically generated "textbook-quality" Python code achieved 50.6% pass@1 on HumanEval in 2023, matching models ten times its size trained on raw internet scrapes. That result, from Microsoft Research's phi-1, forced a reassessment of a long-held assumption: that more code data always beats better code data. The key was not a larger model or a longer training run, but a deliberate curation of what the model saw.
Why code is different from general text
Natural language has no ground-truth verifier. Whether a paraphrase is "correct" is a matter of degree. Code is different: you can execute it.
That property transforms the synthetic data pipeline. For text generation, rejection sampling filters candidates based on a reward model's score, which is itself learned and imperfect. For code generation, you can run unit tests. A solution either passes or fails. This hard binary signal is cheap to obtain and extremely reliable, which is why execution-based filtering has become the central mechanism for code synthetic data quality control.
The second distinguishing feature is structure. Code has a syntax tree, type annotations, docstrings, function signatures, and test files, all co-located in a repository. That metadata is free supervision. A model trained to predict masked function bodies given docstrings is learning from structure the human developer already wrote; no synthetic generation is needed. Synthetic data pipelines for code typically combine this structural supervision with generated diversity - using a teacher model to invent new tasks the corpus does not naturally cover.
Instruction synthesis: turning code into problems
The earliest large-scale method for synthetic code instruction data was Self-Instruct (Wang et al., 2023), which seeded a model with a handful of human-written examples, then prompted it to generate new task descriptions and their solutions. The key insight was in-context bootstrapping: a model primed with three to five good examples will mostly produce coherent variants, and the bad ones can be filtered.
WizardCoder applied the Evol-Instruct principle directly to code. Starting from a base instruction (e.g., "write a function that reverses a list"), the method applied a sequence of rewriting operations:
depth: add constraints, edge cases, or error handling
breadth: create a conceptually related but distinct task
Each operation produced harder or more varied instructions without human authoring. The result was a distribution of problems spanning easy warmup questions to multi-step algorithmic challenges, all sourced from a handful of seed examples.
OSS-Instruct (Magicoder, Wei et al., ICML 2024) addressed a subtlety that Evol-Instruct missed: purely model-generated instructions tend to cluster around patterns already well represented in the model's pretraining data. By seeding generation with random snippets of real open-source code, OSS-Instruct produced a much more diverse coverage of library APIs, idioms, and domain-specific patterns. MagicoderS-CL-7B, trained on 75,000 OSS-Instruct examples, surpassed ChatGPT on HumanEval+ (66.5% vs 65.9% pass@1).
Distillation and the "textbook quality" filter
Phi-1 (Gunasekar et al., 2023) used a two-stage data strategy. The first stage filtered The Stack and StackOverflow for "textbook quality": a classifier trained to identify pedagogically clear, well-commented code. The second stage generated synthetic exercises and solutions using GPT-3.5, explicitly prompted to produce material that teaches a concept step by step, with worked examples and edge-case coverage.
The result was a 7-billion-token training set, far smaller than typical code pretraining corpora (StarCoder trained on over one trillion tokens). Yet phi-1 matched much larger models on benchmarks, demonstrating that textbook-style density of signal per token can compensate for volume.
The distillation mechanism here is subtle: GPT-3.5 is not just generating code; it is explaining reasoning, annotating intent, and constructing progressions of difficulty. The student model absorbs not only correct solutions but the pedagogical scaffolding around them. This transfers more cleanly than raw output distillation, where the student only sees the final answer.
A rough framing of the signal density contrast:
| Data source | Tokens | Pass@1 HumanEval |
|---|---|---|
| Raw web scrape (typical) | ~1T+ | baseline |
| phi-1 textbook + synthetic | ~7B | ~50.6% at 1.3B params |
| OSS-Instruct (Magicoder 7B) | ~75K examples | ~66.5% |
The numbers are not directly comparable (different model sizes and training regimes), but the directional signal is consistent: curated synthetic data closes most of the gap.
Rejection sampling with execution as the verifier
Given a pool of candidate solutions for a programming problem, rejection sampling simply keeps those that pass a test suite and discards the rest. The kept solutions become fine-tuning data.
The loop can run iteratively. Start with a base model, sample K solutions per problem, execute tests, keep passes, fine-tune, repeat. Because the model improves each round, later rounds produce higher pass rates and richer solution diversity.
The practical bottleneck is test coverage. If the test suite only checks the happy path, a model can learn to special-case inputs rather than implement the algorithm correctly. The quality of the verifier is the ceiling of the loop. Research workarounds include:
- Mutation testing: vary the inputs programmatically to construct more tests from an existing suite.
- Specification mining: prompt a separate model to generate additional test cases from the docstring, then cross-check them against known-good reference implementations.
- Property-based testing: write tests as invariants (output length equals input length, output is sorted) rather than fixed input-output pairs.
DeepSeek-Coder (Guo et al., 2024), a family from 1.3B to 33B parameters, used repository-level fill-in-the-blank training alongside execution-based filtering for instruction-tuned variants, achieving competitive benchmark results against much larger proprietary models.
When it falls down
Verifier leakage. If the test suite used for rejection sampling overlaps with the evaluation benchmark, pass rates are inflated. This is common: HumanEval problems are well known, and a teacher model prompted to generate "similar problems" will often regenerate near-duplicates. Held-out evaluation on EvalPlus or LiveCodeBench (which rotate problems) often reveals significantly lower performance than HumanEval alone suggests.
Distribution collapse. Evol-Instruct and similar methods start from a seed set. After many evolutionary steps, the synthetic corpus can become homogeneous: the same algorithmic patterns, the same library calls, the same problem framing. A model trained on this distribution learns well within it but fails on tasks that require genuinely unfamiliar API usage or domain-specific knowledge outside the seed. OSS-Instruct partially addresses this by anchoring each generation in real code, but the anchor corpus itself has coverage gaps.
Model collapse from recursive training. When synthetic data generated by a model is used to fine-tune that same model, then the updated model generates the next round of synthetic data, and so on, the distribution of outputs tends to narrow over iterations. Shumailov et al. (2023) showed that training on model-generated data causes "irreversible defects" where the tails of the original distribution disappear. For code, this means rare but valid idioms, less popular languages, and unusual but correct algorithms get progressively under-represented. Grounding each round in real code snippets or human-authored problems is the primary mitigation.
Correctness versus quality. Passing unit tests is necessary but not sufficient for good code. A solution may pass all tests while being unreadable, algorithmically inefficient, using deprecated APIs, or introducing subtle security issues. Execution-based filtering with unit tests cannot penalise these properties. Adding a secondary quality filter (a reward model trained on human code review labels, or a static analyser) adds complexity and introduces the same reward-hacking risks as reward models in general NLP.
Language and domain bias. Most synthetic code pipelines start from Python-heavy corpora, and teachers prompted in Python tend to generate Python examples even when instructed otherwise. Models fine-tuned on such data perform substantially better in Python than in less-represented languages, and the gap widens with each training iteration that defaults to the same seed.
Further reading
- Gunasekar et al. (2023). "Textbooks Are All You Need." https://arxiv.org/abs/2306.11644
- Wei et al. (2024). "Magicoder: Empowering Code Generation with OSS-Instruct." ICML 2024. https://arxiv.org/abs/2312.02120
- Guo et al. (2024). "DeepSeek-Coder: When the Large Language Model Meets Programming." https://arxiv.org/abs/2401.14196
- Shumailov et al. (2023). "The Curse of Recursion: Training on Generated Data Makes Models Forget." https://arxiv.org/abs/2305.17493