Distilling Reasoning Traces

A T5 model fine-tuned purely on (question, answer) pairs tops out around 8% on grade-school maths (GSM8K). Fine-tune the same model on chain-of-thought traces generated by a 540B teacher and it reaches 22% - on the same number of training examples. The difference is not extra parameters or more data; it is the presence of intermediate reasoning steps as a training target. That single observation motivates the entire field of reasoning-trace distillation.

What a reasoning trace actually is

A reasoning trace is the scratchpad text a model produces between receiving a question and emitting an answer. It might be natural language ("First, convert miles to kilometres. 5 miles = 8 km. Then..."), pseudocode, symbolic algebra, or a mixture. The trace is not evaluated by the downstream task metric - only the final answer is - but it conditions each next token during autoregressive generation, effectively routing the computation through an explicit intermediate representation.

Large models trained at scale develop traces that are factually grounded and procedurally coherent. Small models, trained only on (input, output) pairs, produce traces that look plausible but lead to wrong answers at much higher rates. The conjecture - supported by several ablations - is that small models lack the capacity to discover good reasoning strategies from final-answer supervision alone. Providing the trace as a training target side-steps this discovery problem.

The distillation pipeline

The core pipeline has three stages:

1. SAMPLE  - query teacher T with prompt P; collect N traces per question Q
             output: {(Q, trace_i, answer_i) for i in 1..N}

2. FILTER  - keep only traces where answer_i == gold_label
             (rejection sampling: discard traces that led to wrong answers)

3. FINE-TUNE - train student S on {(Q, trace_i)} pairs with cross-entropy loss
               treat the trace + answer as the full target sequence

The filter step - rejection sampling - is the mechanism that turns a probabilistic teacher into a clean training signal. Without it, erroneous traces propagate into the student and hurt performance relative to answer-only fine-tuning.

Trace diversity matters. Yuan et al. (2023) showed that collecting traces from multiple sampling temperatures, or from multiple teacher checkpoints, and mixing them into the fine-tuning set, yields larger gains than an equal number of traces sampled at a single temperature. The intuition: a student exposed to several valid reasoning paths learns a more robust mapping rather than memorising one surface form.

Fine-tuning regime	LLaMA-7B GSM8K accuracy
SFT on gold answers only	35.9%
RFT (rejection-sampled traces, same teacher)	41.7%
RFT (traces from multiple teachers/temps)	49.3%

Source: Yuan et al., arXiv 2308.01825.

Beyond answer filtering: explanation traces

Orca (Mukherjee et al., 2023, arXiv 2306.02707) extended the basic pipeline by augmenting system prompts with an explicit instruction for GPT-4 to produce explanation traces - not just a final answer. The student (13B parameters) learned from three signal types simultaneously: the explanation trace, the step-by-step thought process, and the final answer. Compared to a 13B model fine-tuned on Vicuna-style (prompt, response) pairs without traces, Orca scored more than 100% higher on Big-Bench Hard. This illustrates a key design choice: the richer the teacher signal you capture in the fine-tuning data, the more you can transfer.

The constitutional AI loop (Bai et al., 2022, arXiv 2212.08073) uses a related idea in a different direction. The model critiques and revises its own outputs through chain-of-thought style reasoning, and the revised outputs become supervised fine-tuning data. This is self-distillation rather than teacher-student distillation, but the mechanism is the same: reasoning steps in the intermediate computation become training targets, not just implicit gradient paths.

When it falls down

Trace hallucination propagates. Rejection sampling catches wrong final answers, but a trace can contain factual errors or invalid logical steps while still arriving at the correct answer by coincidence or by error cancellation. The student learns the erroneous intermediate steps. This is particularly sharp in multi-hop factual reasoning, where a trace might state an incorrect intermediate fact but happen to reach the right entity anyway.

Distribution mismatch at inference. A student fine-tuned on teacher traces generates its own traces at inference time. If the student's trace quality degrades mid-sequence - a common failure for tasks longer than the training traces - the final answer quality drops sharply. This is a form of exposure bias: the student never trained on its own imperfect traces as prefixes. Adding a verification or self-consistency step at inference (Wang et al., 2023, arXiv 2203.11171) partially mitigates this.

Specialisation vs. generalisation. Fu et al. (2023, arXiv 2301.12726) showed that smaller models trained on domain-specific reasoning traces become better at that domain at the cost of performance on other tasks. A model distilled on maths reasoning traces degrades measurably on language understanding benchmarks. If the deployment target is narrow, this is an acceptable trade; if the student needs to be a general assistant, over-distilling on one reasoning domain hurts.

Collapse under self-distillation loops. When a model generates its own fine-tuning data and retrains, the data distribution narrows each iteration. Rare but valid reasoning strategies disappear from the sample pool; the model collapses toward a smaller set of high-probability traces. This is the synthetic data collapse problem applied specifically to reasoning: the model becomes more confident but less calibrated, and novel problem structures it cannot map to its dominant trace pattern see catastrophic failures.

Teacher ceiling. The student cannot exceed the teacher's accuracy on problems the teacher reliably solves. On tasks where even GPT-4 reasons unreliably - multi-step spatial reasoning, certain combinatorial problems - the rejection sampling filter becomes very aggressive (most traces are discarded), and the resulting fine-tuning set is too small or too narrow to be useful.

What a reasoning trace actually is

The distillation pipeline

Beyond answer filtering: explanation traces

When it falls down

Further reading