← Concept library

Evaluation & MLOps

Public Benchmarks - MMLU, GPQA, HumanEval, MATH

A tour of the academic benchmarks that anchor frontier model launches, and why most of them are saturating, contaminated, or both.

intermediate · 8 min read

Every model card opens with a table of benchmark scores. They are how labs signal progress and how the press writes headlines. They are also, almost without exception, broken in ways that matter for anyone deciding whether to ship a model.

What each one actually measures

Benchmark Format What it probes Frontier state (2025-2026)
MMLU 4-way multiple choice, 57 subjects Broad undergraduate-level knowledge Saturated - frontier models score 88-92%, human expert ceiling ~90%
GPQA-Diamond 4-way multiple choice, ~200 graduate science questions PhD-level biology, physics, chemistry Still useful - PhD experts hit ~65%, frontier models cross 70-80%
HumanEval 164 Python function-completion problems Basic code synthesis from docstrings Saturated - frontier models pass 90%+
HumanEval+ (EvalPlus) Same prompts, 80x more tests Same skill with stronger correctness checks Knocks 20-30 points off original HumanEval scores
MATH 12,500 high-school competition problems Multi-step symbolic reasoning Approaching ceiling with reasoning models
SWE-bench 2,294 real GitHub issues across 12 repos End-to-end repository-scale software engineering Active frontier - best public scores in the 60-70% range

Saturation

MMLU was hard in 2020 and routine by 2024. Once a benchmark hits the human-expert ceiling, score deltas between models stop tracking real capability and start tracking prompt-engineering effort, evaluation harness choice, and overfitting to the public test split. The Hugging Face team showed that the same model evaluates at 49% or 64% on MMLU depending on which harness you run - prompt formatting, log-likelihood normalisation, and answer-extraction logic all shift the score by tens of points. A leaderboard table without a harness footnote is roughly meaningless.

Contamination

Internet-scraped pretraining data overlaps with the public test sets of every popular benchmark. Models can recognise the questions, sometimes verbatim, sometimes paraphrased through translation memorisation. The Sainz et al. EMNLP 2024 position paper argues current NLP evaluation is "in trouble" and that contamination is silently inflating scores and corrupting the literature. Decontamination at training time helps but cannot be verified by external readers, which is why the field keeps minting new benchmarks faster than models can saturate them.

The "frontier benchmark of the month" cycle

GPQA-Diamond, ARC-AGI, FrontierMath, Humanity's Last Exam - each becomes the headline metric for one or two release cycles, gets gamed or saturated, and is replaced. Treat any single benchmark as a leading indicator, not a verdict. The defensible play for an applied team is to track a basket of three or four current frontier benchmarks plus your own internal evaluation set.

Coding: HumanEval, HumanEval+, SWE-bench

HumanEval was the de-facto code benchmark for years; the original test suite averages around three tests per problem, which lets buggy code pass. EvalPlus blows the test count up 80x and shows that pass rates fall by 19-29% across most models - many earlier claims of "human-level coding" were artefacts of weak tests. SWE-bench raised the bar further by demanding the model fix real GitHub issues end to end. It is the closest open benchmark we have to "can this model be a junior engineer."

When it falls down

  • Multiple choice rewards pattern matching. A model can pick the right letter from elimination heuristics without ever forming the underlying reasoning.
  • Public benchmarks leak. If the test set is on the internet, assume it is in pretraining.
  • Scores do not compose. 92% on MMLU and 90% on HumanEval do not predict 80% on your specific extraction or routing task.
  • Reasoning models change the shape. OpenAI o-series and similar test-time-compute models can pay more tokens for a higher score, which makes wall-clock-equal comparisons unfair.

Further reading