← Concept library

Reasoning Models

Reasoning Evals and the Contamination Problem

A guided tour of the reasoning benchmark canon, why each saturated faster than the field expected, and the move to live held-out evals as the contamination crisis bites.

intermediate · 7 min read

A benchmark is useful only as long as it is unseen. The dominant story of LLM evaluation in 2023-2025 is benchmarks the field designed to last for years getting saturated in months, partly because models genuinely got better, partly because the benchmarks leaked into training data. This section maps the current reasoning canon, explains the contamination problem honestly, and points at the live-eval strategies that are replacing the old static suites.

The reasoning canon

A short tour of the benchmarks that currently anchor reasoning claims:

Benchmark What it measures Saturation status
AIME (2024, 2025) US high-school maths-olympiad qualifier problems o1/R1/o3 at 80-90%+ pass@1
MATH (Hendrycks et al, 2021) 12.5k competition maths problems across 5 levels MATH-500 effectively saturated (97%+)
GPQA (Rein et al, 2023) PhD-level multiple-choice science questions, google-proof by construction GPQA Diamond ~75% (human PhD baseline ~65%)
FrontierMath (Epoch, 2024) New, unpublished, expert-authored frontier maths Still hard (~25-35% best as of late 2025), regarded as the live maths frontier
ARC-AGI (Chollet, 2019; v2/v3 ongoing) Few-shot grid reasoning puzzles designed to resist training-set memorisation ARC-AGI-1 broken by o3-high (87.5%); ARC-AGI-2 still unbeaten at parity
SWE-bench Verified (2024) Real GitHub issues with verifiable patch tests Top reasoning models 50-65%
LiveCodeBench (2024) Coding problems annotated with release dates for contamination control Used as a contamination-resistant coding eval

The pattern: a benchmark drops, the field talks about it being a frontier marker, a year later one or two models post within-noise-of-human numbers, and the field moves on to the next one. AIME 2024 was supposed to last; o1 cleared it inside three months. MATH was supposed to last; saturated. GPQA was supposed to last; cracked in a year.

Why they saturate faster than expected

Three causes, in roughly increasing seriousness:

  1. Real capability gains. Reasoning models really are better at competition maths than the previous generation. Some of the jump is genuine.
  2. Test-time compute makes accuracy a sliding scale. A model that solves 40% of AIME pass@1 might solve 85% at pass@64 with self-consistency. The "saturated" headline number is often a maximum-compute number; the cheap-call number is much lower.
  3. Contamination. The benchmark questions appear, verbatim or paraphrased, in training data scraped from the open web. The model is not solving the problem; it is recognising it.

The third one is the load-bearing methodological issue.

What contamination actually looks like

Direct leakage:

  • A MATH problem in a Stack Exchange answer.
  • A GSM8K question republished on a tutoring blog.
  • An AIME problem with worked solution on a maths-forum thread.
  • Benchmark training/test split published as a Hugging Face dataset, scraped wholesale.

Indirect leakage:

  • A paraphrase or translation of the benchmark question.
  • A textbook chapter that uses the same problem as a worked example.
  • A YouTube transcript walking through the answer.

Diagnostic patterns in benchmark numbers:

  • Sharp drop in performance on problems released after the model's training cutoff (LiveCodeBench's central observation).
  • High pass@1 with low diversity in incorrect answers (the model memorised the canonical answer, not the reasoning).
  • Performance on the public split far exceeding performance on a held-out replication.

The field has moved from treating contamination as a fringe concern to treating it as the default assumption for any public benchmark older than a year.

ARC-AGI as the standing public challenge

ARC-AGI (Abstraction and Reasoning Corpus) was designed in 2019 by François Chollet specifically to be contamination-resistant: novel, never-seen grid puzzles where the model must induce a transformation rule from a few examples. The puzzles cannot be memorised because each one is unique.

ARC-AGI-1 held up for years before o3-high beat the human baseline at high compute. ARC Prize Foundation now runs ARC-AGI-2 (still unbeaten at parity with human solve rate) and ARC-AGI-3 (agentic). The benchmark is the closest thing the field has to a standing public-challenge for genuinely novel reasoning, and the prize pool plus public leaderboard make it a useful focal point.

The structural reason ARC works: tasks are programmatically generatable, not crowd-sourced from textbooks. Contamination requires the exact test puzzle leaking, which is controllable.

Live and held-out evals

The methodological response to contamination is to make the eval move:

  • LiveBench (White et al, ICLR 2025 Spotlight) - frequently-updated questions from recent maths competitions, arXiv papers and news; automatic objective scoring; monthly refresh. Top models reportedly below 70% as of release.
  • LiveCodeBench - coding problems with explicit release dates, enabling evaluation only on problems released after a model's training cutoff. Built-in contamination guardrail.
  • FrontierMath - expert-authored, unpublished, problem set held privately by Epoch AI; only aggregate scores are reported, never individual problems.
  • Private held-out splits - UK AISI, US AISIC, and lab internal evals keep test items secret. Same principle, different governance.

The unifying idea: if the eval can leak, eventually it will. The defence is either temporal (only score on post-cutoff items), secrecy (never publish the test items), or generative (new items every period).

What we are even measuring

A nagging meta-question. Saturation on AIME and MATH was supposed to mean the model can do competition maths. In practice it sometimes means the model has seen the answer. Saturation on GPQA was supposed to mean PhD-level science reasoning. In practice it sometimes means the model has the multiple-choice answer key memorised.

The careful framing:

  • A benchmark score is an upper bound on capability under the conditions tested.
  • A delta over a strong baseline on a contamination-controlled benchmark is meaningful.
  • A headline number on a public, static benchmark older than 18 months is mostly a marketing artefact.
  • The only benchmarks that genuinely measure reasoning capability are the ones where the model has never seen the questions, and that property erodes over time even for well-designed evals.

Reasoning evaluation is now a moving practice rather than a fixed scoreboard. Engineers picking models for reasoning workloads should weight LiveBench / LiveCodeBench / FrontierMath / ARC-AGI rankings over the older static suites, and weight their own private internal evals over all of those.

What changed in 2024-2026

The field internalised that public benchmarks decay. New evals ship with explicit contamination defences (release dates, private holdouts, monthly refreshes). Vendor model cards started including post-cutoff eval scores alongside the saturated old ones. ARC-AGI graduated from "interesting puzzle set" to "standing public challenge with a $2M+ prize pool". The honest takeaway: benchmark numbers in 2026 carry less information than they did in 2022, and the engineering response is to invest in your own evals on data the model has never seen.

Further reading