Self-Play and Self-Improvement

AlphaGo Zero never saw a single human game. Starting from random play and competing against itself, it surpassed every human player within 40 days of training. The same core intuition - a model improving by playing against itself - now runs through the most productive data-generation pipelines in language modelling. The question is whether the intuition transfers cleanly, or whether language models carry failure modes that board games do not.

What the loop actually looks like

The basic self-improvement pipeline has three moving parts: a generator, a verifier, and a data filter.

output = W * x  +  (B * A) * x * (alpha / r)

The generator is usually the model itself. The verifier can be the same model (self-critique), a reward model trained on human preferences, a symbolic checker (unit tests, a maths solver), or a constitution of written principles. The filter discards weak candidates, leaving a training set biased toward behaviour the verifier rewards.

One training pass over that set nudges the generator toward the verified distribution, which ideally raises the floor of future generations. Repeat.

In practice the components are mixed and matched:

Variant	Generator	Verifier	Example system
Rejection sampling fine-tuning (RFT)	Base model	Symbolic solver	GSM8K / MATH pipelines
Self-Rewarding LM	Instruct model	Same model via LLM-as-judge prompt	Yuan et al., ICML 2024
Constitutional AI	Supervised model	Same model + written principles	Anthropic Claude training
Distillation	Teacher LM	Teacher logits / outputs	Alpaca, Orca-style models

Rejection sampling fine-tuning

Rejection sampling fine-tuning (RFT) is the cleanest variant to reason about. For a problem set with a ground-truth answer, you:

Sample k solutions per problem from the current model.
Execute or verify each solution; label it correct or incorrect.
Discard incorrect solutions.
Fine-tune on the correct ones.

Yuan et al. (2023) applied this to LLaMA-7B on GSM8K arithmetic reasoning. By combining solutions rejected-sampled from multiple model checkpoints, they pushed accuracy from 35.9% (supervised fine-tuning alone) to 49.3%. The gain comes from coverage: more samples surface correct reasoning paths the model can only occasionally find, and fine-tuning compresses those paths into the weights.

The critical dependency is a reliable verifier. Arithmetic problems have one: run the final expression. Without that oracle, you need the model to judge its own output, which is where things get more complicated.

The constitutional loop

Constitutional AI (CAI), introduced by Anthropic in 2022, bootstraps a verifier from a list of natural-language principles - the "constitution." The supervised learning (SL) phase works as follows:

Prompt a helpful-but-harmful model to produce a response to a red-team prompt.
Prompt the same model to critique that response against a constitutional principle (e.g., "Does this response assist with something harmful?").
Prompt it to revise the response based on its own critique.
Repeat critique-and-revise for several rounds.
Use the final revision as supervised training data.

The RL phase (RLAIF) replaces human preference labels with AI preference labels: the model ranks pairs of responses according to the constitution, and those preferences train a reward model used for PPO.

This closes a labour bottleneck. Scaling human annotation for harmlessness is expensive; scaling a principled self-critique loop is not. The cost paid is that quality is bounded by the model's ability to apply the constitution consistently, which itself depends on the model's existing capabilities.

Distillation as a self-play adjacent technique

Knowledge distillation is slightly different: a weaker student model is trained on the outputs of a stronger teacher, rather than the model training on its own outputs. But it sits inside the same paradigm of synthetic data because the teacher's outputs are generated rather than human-labelled.

The Stanford Alpaca model (2023) trained a 7B-parameter LLaMA model on 52,000 instruction-following examples generated by text-davinci-003. The cost was under $500. The resulting model behaved surprisingly like a much larger instruction-tuned model on casual benchmarks, though the effect did not hold on rigorous evals of reasoning or factual accuracy.

The lesson from distillation is that instruction following transfers cheaply while deep reasoning does not. A student trained on teacher outputs can mimic the surface form of expert responses without inheriting the underlying capability - a phenomenon sometimes called "shallow alignment."

When it falls down

Reward hacking and overoptimisation. When the verifier is imperfect, the generator can find outputs that score well without being genuinely better. A reward model trained on human preferences is itself an approximation; optimise against it long enough and the generator will produce outputs that exploit its blind spots rather than improve in the intended sense. The relationship between proxy reward and true quality degrades as optimisation pressure increases (Goodhart's Law in a statistical setting).

Mode collapse in the data distribution. Each filtering step throws away low-scoring outputs. Over many iterations this narrows the distribution of the training data toward whatever the verifier rewards most. The model loses stylistic range, fails on unusual but valid requests, and can drift toward verbose or sycophantic outputs if the reward signal correlates with length or agreement.

Self-consistency as a ceiling. A model critiquing its own outputs is bounded by its own blind spots. If the model consistently misunderstands a class of queries, its critiques of responses to those queries will also be wrong, and fine-tuning on the revised responses entrenches the misunderstanding. Constitutional AI mitigates this somewhat by offloading the normative judgement to written principles, but the model must still interpret and apply those principles.

Evaluation contamination. Benchmark scores inflated by self-play can be deceptive. If the generator has seen the test distribution at any point in its training or in the prompts used to generate synthetic data, the loop is sampling near the evaluation manifold rather than genuinely generalising. The GSM8K gains from RFT, for instance, look smaller when evaluated on held-out problem variations.

Collapse under no ground truth. RFT works because arithmetic has a verifier. For open-ended tasks - summarisation, creative writing, nuanced advice - there is no oracle. Self-critique is the only option, which reintroduces all the self-consistency limitations above. Process reward models (Lightman et al., 2023) attempt to address this by scoring reasoning steps rather than final answers, but collecting step-level annotations is expensive, and learned step verifiers can themselves be fooled.

What the loop actually looks like

Rejection sampling fine-tuning

The constitutional loop

Distillation as a self-play adjacent technique

When it falls down

Further reading