Synthetic Data for Mathematics

Mathematics is where synthetic data generation first showed its teeth. The GSM8K benchmark (8,500 human-written grade-school problems) was state-of-the-art for years, yet a pipeline that spends a weekend generating 7 million variants of those same problems can outperform human-labelled datasets by double-digit percentages on the same benchmark. The question is not whether synthetic math data works - it clearly does. The question is why, and where it quietly breaks.

Why mathematics is a privileged domain for synthesis

Math problems have an unusual property: correctness is decidable. Given a candidate solution, a Python interpreter or a symbolic solver can check the final answer in milliseconds. This makes it possible to run what is called rejection sampling: generate N candidate solutions, execute each, keep only those that reach the verified answer, discard the rest.

This is the core loop in Rejection Sampling Fine-Tuning (RFT), formalised by Yuan et al. (2023):

for each problem p in seed_set:
    solutions = model.generate(p, n=100, temperature=0.7)
    correct   = [s for s in solutions if verify(s, p.answer)]
    augmented_corpus.extend(correct)
model = finetune(model, augmented_corpus)

The model is used both as a generator and (indirectly) as a filter. Weaker models produce fewer correct solutions per problem, so the augmented corpus is naturally sparser for harder problems - a useful implicit curriculum signal. Yuan et al. showed RFT improved LLaMA-7B from 35.9% to 49.3% on GSM8K by combining rejection samples from multiple model checkpoints.

Question rewriting: the MetaMath strategy

Rejection sampling multiplies solution diversity for a fixed set of problems. A complementary axis is problem diversity: rephrase and restructure the questions themselves.

MetaMath (Yu et al., 2023) introduced question bootstrapping via several rewriting operators:

Operator	What it does	Example
Rephrasing	Paraphrase the surface form	"Find x such that..." -> "What value of x satisfies..."
Self-questioning	Break one problem into sub-goals	Emit the intermediate question chain
FOBAR	Reverse the problem: give the answer, ask for an unknown	From "cost of 3 apples at $2 = ?" to "3 apples cost $6; what is the unit price?"
Backward reasoning	Start from the target, reason backwards	Ask for preconditions instead of consequences

Applied to the GSM8K and MATH training splits, this produced the MetaMathQA dataset, used to fine-tune LLaMA-2. MetaMath-7B reached 66.4% on GSM8K - a gain of 11.5 percentage points over the then-best same-size models - without any additional human annotation.

The intuition is that a model trained on many structural views of the same underlying concept forms a more robust internal representation than one trained on a single phrasing. This mirrors standard data augmentation in vision (crops, flips, colour jitter), but in the discrete symbolic domain.

Distillation and the teacher-student gap

Both RFT and MetaMath use the same model (or a small family of checkpoints) as generator and learner. A more powerful variant uses a stronger teacher to annotate a weaker student's training set.

This is knowledge distillation applied to reasoning traces rather than logits. A GPT-4-class teacher generates step-by-step chain-of-thought (CoT) solutions for thousands of problems; a 7B student is then fine-tuned on those traces. The student can exceed the performance of a student trained only on human-written CoT, because the teacher's traces are more systematic and better-calibrated for the specific difficulty level the student needs to learn.

DeepSeekMath (Shao et al., 2024) combined this approach with continued pre-training on 120 billion math-adjacent tokens scraped and filtered from the web, then applied Group Relative Policy Optimisation (GRPO) - a variant of RLHF that scores a group of sampled solutions relative to each other rather than against an absolute reward. Their 7B model reached 51.7% on the competition-level MATH benchmark, close to GPT-4 at the time of publication.

The critical insight from this line of work: pre-training data quality (web-scale math text, filtered rigorously) provides the foundation; synthetic fine-tuning data provides the reasoning style. Neither alone is sufficient at the top of the capability curve.

Process reward models and step-level filtering

Verifying a final numeric answer filters out many wrong solutions but is coarse. A model can reach the correct answer via a flawed reasoning chain, or conversely produce a mostly-correct chain with an arithmetic slip at the last step.

Process Reward Models (PRMs) address this by labelling each step in a solution rather than only the final answer. Lightman et al. (2023) showed that process supervision significantly outperforms outcome supervision on the MATH benchmark, releasing the PRM800K dataset of 800,000 step-level human annotations.

Collecting 800k human labels is expensive. Math-Shepherd (Wang et al., 2023) automated this: for each solution step, the model generates multiple completions forward to the end; a step is labelled "correct" if at least one completion reaches the right answer. This creates a noisy but scalable proxy for step-level correctness, allowing PRM training without any human annotation.

The result is a two-stage synthetic pipeline:

Generate many candidate solutions per problem via rejection sampling.
Score each step using the automated PRM; keep solutions where high-scoring steps dominate.

This filters out solutions that "got lucky" on the final answer via wrong intermediate steps - a measurable quality improvement over outcome-only filtering.

When it falls down

Capability ceiling. Rejection sampling cannot produce solutions the generator cannot produce even once in a large sample. If the teacher model has a hard conceptual block - say, integral transforms or combinatorial proofs - no amount of sampling will synthesise correct examples of those types. The synthetic corpus accurately reflects the generator's distributional biases.

Distribution collapse. If you iteratively fine-tune the model on its own outputs and repeat, each generation is drawn from a slightly narrowed distribution. Over several rounds, the model can converge to a narrow style of solution - typically over-formatted, verbosely explicit, and poor at novel problem structures. This is a form of model collapse analogous to what happens in GANs or in recursive self-distillation without external grounding.

Verifier leakage. When the verifier (say, a Python eval() of the numeric answer) becomes part of the training signal, models learn to produce solutions whose final line passes the verifier rather than solutions that are genuinely correct. Numeric answers can match by coincidence on constrained integer problems; this inflates benchmark numbers without reflecting real reasoning.

Benchmark contamination. GSM8K and MATH are small enough that their test problems (or near-paraphrases) can appear in web-scraped pre-training corpora, question-rewriting augmentations, or popular synthetic datasets. Models that have seen paraphrased test problems will overestimate their generalisation. Evaluating on new, unseen distributions (e.g., competition problems from recent years) is the only reliable check.

Stylistic homogenisation. Large synthetic datasets from a single teacher impose that teacher's solution style uniformly. Student models can become brittle when test-time problems are presented in a different format or require a different solution strategy (e.g., geometric reasoning versus algebraic manipulation).

Why mathematics is a privileged domain for synthesis

Question rewriting: the MetaMath strategy

Distillation and the teacher-student gap

Process reward models and step-level filtering

When it falls down

Further reading