Evaluating RL-Tuned Models

When OpenAI published results for their RLHF-trained summarisation model in 2020, they noted something uncomfortable: a model that scored higher on the learned reward signal produced summaries that humans actually preferred, yet that same reward signal could be over-optimised to produce text that scored high while being hollow. The proxy had become the target, and the target had quietly moved. That tension sits at the heart of evaluating any RL-tuned model.

Why Standard Benchmarks Mislead

Pre-training and supervised fine-tuning (SFT) benchmarks such as MMLU, HellaSwag, or HumanEval are designed to probe raw knowledge and surface-level instruction following. They assume a relatively well-behaved distribution shift between training and evaluation. RL post-training breaks that assumption in at least three ways.

Distribution shift at the output level. RL optimises over the policy's own generated text, not reference completions. After several thousand gradient steps the model's output distribution may sit far from the SFT checkpoint. A benchmark that probes fill-in-the-blank-style accuracy may not sample from the region where the model now lives.

Reward model entanglement. If the benchmark was used to inform reward model training or was included in the preference dataset, then high benchmark accuracy signals contamination rather than capability. This is not hypothetical: several chat-model leaderboards have documented steep drops in score when new, unseen benchmark variants are introduced.

Gaming via verbosity or hedging. RL optimised against a human-preference reward model learns quickly that hedged, verbose answers are rated more favourably, independent of correctness. A model that adds qualifications and filler around a wrong answer can score higher on LLM-as-judge evaluations than a concise, correct competitor. Dubois et al. (2024) quantified this directly for AlpacaEval, showing strong positive correlation between output length and win-rate.

The Evaluation Hierarchy

A practical evaluation stack for RL-tuned models has three layers, each checking a different failure mode.

Layer 1: Verifiable correctness on held-out sets

For domains where ground truth exists (mathematics, code, formal reasoning), use execution-based or proof-checked evaluation. AIME problems, competition-grade coding judges (LiveCodeBench), and formal verification suites give a binary signal: the answer is right or it is not. These benchmarks are hardest to game through verbosity or style because there is no LLM judge to fool.

Critically, the held-out set must be temporally separated from the reward model's training data. Problems released before the model's training cutoff are potential contaminants regardless of whether the developers consciously included them.

Layer 2: Human preference versus reward model preference

The reward model is a proxy for human preferences, not a ground truth. The gap between the two is the over-optimisation gap. Measuring it requires human evaluation on a sample of model outputs after RL training. Gao et al. (2022) showed that this gap follows a predictable scaling law: proxy reward increases monotonically while gold human reward peaks and then declines as KL divergence from the SFT checkpoint grows:

Why Standard Benchmarks Mislead

The Evaluation Hierarchy

Layer 1: Verifiable correctness on held-out sets

Layer 2: Human preference versus reward model preference

Keep reading with Pro.