Public Benchmarks - MMLU, GPQA, HumanEval, MATH

Every model card opens with a table of benchmark scores. They are how labs signal progress and how the press writes headlines. They are also, almost without exception, broken in ways that matter for anyone deciding whether to ship a model.

What each one actually measures

Benchmark	Format	What it probes	Frontier state (2025-2026)
MMLU	4-way multiple choice, 57 subjects	Broad undergraduate-level knowledge	Saturated - frontier models score 88-92%, human expert ceiling ~90%
GPQA-Diamond	4-way multiple choice, ~200 graduate science questions	PhD-level biology, physics, chemistry	Still useful - PhD experts hit ~65%, frontier models cross 70-80%
HumanEval	164 Python function-completion problems	Basic code synthesis from docstrings	Saturated - frontier models pass 90%+
HumanEval+ (EvalPlus)	Same prompts, 80x more tests	Same skill with stronger correctness checks	Knocks 20-30 points off original HumanEval scores
MATH	12,500 high-school competition problems	Multi-step symbolic reasoning	Approaching ceiling with reasoning models
SWE-bench	2,294 real GitHub issues across 12 repos	End-to-end repository-scale software engineering	Active frontier - best public scores in the 60-70% range

Saturation

MMLU was hard in 2020 and routine by 2024. Once a benchmark hits the human-expert ceiling, score deltas between models stop tracking real capability and start tracking prompt-engineering effort, evaluation harness choice, and overfitting to the public test split. The Hugging Face team showed that the same model evaluates at 49% or 64% on MMLU depending on which harness you run - prompt formatting, log-likelihood normalisation, and answer-extraction logic all shift the score by tens of points. A leaderboard table without a harness footnote is roughly meaningless.

Contamination

Internet-scraped pretraining data overlaps with the public test sets of every popular benchmark. Models can recognise the questions, sometimes verbatim, sometimes paraphrased through translation memorisation. The Sainz et al. EMNLP 2024 position paper argues current NLP evaluation is "in trouble" and that contamination is silently inflating scores and corrupting the literature. Decontamination at training time helps but cannot be verified by external readers, which is why the field keeps minting new benchmarks faster than models can saturate them.

The "frontier benchmark of the month" cycle

GPQA-Diamond, ARC-AGI, FrontierMath, Humanity's Last Exam - each becomes the headline metric for one or two release cycles, gets gamed or saturated, and is replaced. Treat any single benchmark as a leading indicator, not a verdict. The defensible play for an applied team is to track a basket of three or four current frontier benchmarks plus your own internal evaluation set.

Coding: HumanEval, HumanEval+, SWE-bench

HumanEval was the de-facto code benchmark for years; the original test suite averages around three tests per problem, which lets buggy code pass. EvalPlus blows the test count up 80x and shows that pass rates fall by 19-29% across most models - many earlier claims of "human-level coding" were artefacts of weak tests. SWE-bench raised the bar further by demanding the model fix real GitHub issues end to end. It is the closest open benchmark we have to "can this model be a junior engineer."

When it falls down

Multiple choice rewards pattern matching. A model can pick the right letter from elimination heuristics without ever forming the underlying reasoning.
Public benchmarks leak. If the test set is on the internet, assume it is in pretraining.
Scores do not compose. 92% on MMLU and 90% on HumanEval do not predict 80% on your specific extraction or routing task.
Reasoning models change the shape. OpenAI o-series and similar test-time-compute models can pay more tokens for a higher score, which makes wall-clock-equal comparisons unfair.