Rejection Sampling Fine-Tuning and STaR

Meta's Llama 2 Chat models were partially trained on outputs sampled from a stronger model, filtered by a reward model, and then used to fine-tune the weaker one. That pipeline is called rejection sampling fine-tuning (RFT). At roughly the same time, a Stanford paper called STaR showed a simpler, annotation-free version of the same idea: let a model attempt a reasoning problem, check whether it got the answer right, and if so, treat its own chain-of-thought trace as labelled training data. Both methods rest on the same statistical intuition - if you sample enough completions, correct solutions exist in the distribution already; the problem is extracting them cheaply.

What rejection sampling fine-tuning actually does

Given a prompt distribution P and a policy model π, the standard supervised fine-tuning (SFT) objective is:

L_SFT = -E_{(x,y) ~ D}[log π(y | x)]

where y is a human-written completion. In RFT the human-written y is replaced by model-generated completions filtered through a verifier:

D_RFT = { (x, y_i) : y_i ~ π(· | x),  V(x, y_i) = 1 }

V is a verifier - either a learned reward model or (much cheaper) a ground-truth answer checker for tasks with deterministic correctness like maths or code. You sample k completions per prompt, discard those where V=0, and fine-tune on what remains.

Yuan et al. (2023) applied exactly this to GSM8K-style maths problems with 7B and 13B LLaMA variants. Sampling k=100 chains per question and keeping only those reaching the correct final answer, they moved a 7B model from 35.9% to 49.3% accuracy on GSM8K - without any additional human annotation. Two observations that hold consistently across later work:

Diversity matters more than volume. Collecting 100 distinct reasoning paths to the same answer outperforms collecting 100 near-identical paths. The fine-tuning signal comes from learning different routes to correctness.
Weaker models gain more. A model whose pass@1 is 20% has lots of room to move; a model at 80% will see diminishing returns because most of its filtered samples are already close to its current distribution.

STaR: self-taught reasoner

STaR (Zelikman et al., 2022) generalises the idea to settings where you have no ground-truth answer checker, only a labelled answer. The algorithm:

Sample a chain-of-thought rationale r and final answer a from the current model for each training question q.
If a matches the gold label, add (q, r, a) to the fine-tuning set.
For questions where the model failed, provide the gold answer as a hint and ask the model to rationalise why that answer is correct - called rationalisation. If a plausible rationale emerges, add it too.
Fine-tune on the accumulated set, producing a new policy.
Repeat from step 1.

The rationalisation step is subtle. Without it, early iterations only accumulate data from problems the model can already solve, which leaves hard problems permanently outside the training distribution. By prompting "given that the answer is X, explain why", the model can often reconstruct a coherent reasoning chain even when it could not have generated X unprompted. This fills gaps in the curriculum.

Zelikman et al. showed that iterating STaR on CommonsenseQA matched the accuracy of a model 30 times larger fine-tuned on the full supervised dataset. The key leverage is that the model's own sampling budget replaces human annotators for the majority of examples.

The connection to RL: why this is easier, and what it sacrifices

Rejection sampling fine-tuning is sometimes described as a simplified form of REINFORCE. The full REINFORCE gradient is:

∇L = E_{y ~ π}[r(y) · ∇ log π(y | x)]

RFT approximates this by using a binary reward (correct/incorrect) and throwing away the negative samples rather than penalising them. This has three practical consequences:

No reward hacking on the negative side. Because failed samples are simply discarded, the model does not receive explicit signal pushing it away from near-misses. This is gentler than RLHF but also less informative.
No policy drift correction. Standard RL for LLMs (PPO, GRPO) includes a KL penalty to prevent the policy from wandering too far from a reference model. RFT has no such guard; after several rounds of iteration the distribution can shift significantly, which is both a feature (capability gain) and a risk (see below).
Simpler engineering. You need a verifier and a fine-tuning loop, not a separate critic/reward-model training pipeline, a PPO trainer, and careful hyperparameter tuning across two models simultaneously.

DeepSeek-R1's published training pipeline (2025) used a rejection-sampling stage to build cold-start reasoning traces before the main RL phase. This is now a common pattern: RFT to bootstrap a distribution of long reasoning chains, then RL to sharpen the reward signal. The two methods are complementary rather than competitive.

Iterating without collapse

Running several rounds of RFT or STaR raises a fundamental question: does the model improve monotonically, or does it degrade?

In practice three things limit continued gains:

Round	What you gain	What you risk
1-3	High-quality diverse traces for problems near the capability boundary	Almost nothing - early rounds are safe
4-8	Harder problems start to be solved; distribution shifts	Repetitive reasoning patterns if diversity is not enforced
8+	Marginal gains only; model often already saturates the verifier's task coverage	Mode collapse onto a single "house style" of reasoning; over-optimisation on verifiable proxies

The key mitigation is diversity pressure. Yuan et al. found that combining rejection samples across multiple model checkpoints or temperatures produced larger accuracy gains than running more samples from a single model. STaR's rationalisation step serves a similar function - it injects paths the current policy would never sample spontaneously.

A subtler failure mode is verifier saturation: once the model learns to game the specific format the verifier checks (e.g., always outputting the exact answer token pattern the checker expects), rejection sampling becomes selection on surface form rather than reasoning quality. This is most acute when the verifier is a learned reward model rather than a ground-truth checker, and it mirrors Goodhart's Law at the data-generation level.

When it falls down

No verifier, no signal. RFT requires some way to judge a completion correct or incorrect. For open-ended tasks - summarisation, instruction following, creative writing - there is no cheap ground-truth checker. You need a reward model, which introduces its own failure modes and training cost. STaR's rationalisation sidestep only helps when at least a gold final answer is available.

Capability ceiling. If pass@k for large k is near zero, the model cannot generate any correct completions to collect. RFT produces nothing useful from tasks the model is fundamentally unable to solve. You need to either start from a stronger base model or use a teacher model to seed the initial correct solutions.

Distribution narrowing over iterations. Each round selects for a narrower slice of completions - the correct ones - and trains the model to produce more of that slice. After many rounds the model may become very good at the task the verifier checks and brittle at slight variants, because the training distribution has been pruned to a corridor around one solution style.

Compute asymmetry. To get a high-quality filtered dataset you need to sample many completions per prompt. At k=100 samples per problem, RFT costs 100x more inference compute than a single forward pass. For large models this is a significant infrastructure requirement, not a laptop experiment.

Rationalisation leaks. In STaR, rationalised traces are conditioned on the correct answer. There is a real risk that the model learns to reverse-engineer plausible-looking reasoning from the known answer, rather than learning genuine forward reasoning. The resulting traces may pass a surface-level quality check but not improve the model's ability to solve novel problems. Empirically this seems to be manageable in the first few rounds, but is hard to audit at scale.

What rejection sampling fine-tuning actually does

STaR: self-taught reasoner

The connection to RL: why this is easier, and what it sacrifices

Iterating without collapse

When it falls down

Further reading