Evaluation & MLOps
Public Benchmarks - MMLU, GPQA, HumanEval, MATH
A tour of the academic benchmarks that anchor frontier model launches, and why most of them are saturating, contaminated, or both.
intermediate · 8 min read
Every model card opens with a table of benchmark scores. They are how labs signal progress and how the press writes headlines. They are also, almost without exception, broken in ways that matter for anyone deciding whether to ship a model.
What each one actually measures
| Benchmark | Format | What it probes | Frontier state (2025-2026) |
|---|---|---|---|
| MMLU | 4-way multiple choice, 57 subjects | Broad undergraduate-level knowledge | Saturated - frontier models score 88-92%, human expert ceiling ~90% |
| GPQA-Diamond | 4-way multiple choice, ~200 graduate science questions | PhD-level biology, physics, chemistry | Still useful - PhD experts hit ~65%, frontier models cross 70-80% |
| HumanEval | 164 Python function-completion problems | Basic code synthesis from docstrings | Saturated - frontier models pass 90%+ |
| HumanEval+ (EvalPlus) | Same prompts, 80x more tests | Same skill with stronger correctness checks | Knocks 20-30 points off original HumanEval scores |
| MATH | 12,500 high-school competition problems | Multi-step symbolic reasoning | Approaching ceiling with reasoning models |
| SWE-bench | 2,294 real GitHub issues across 12 repos | End-to-end repository-scale software engineering | Active frontier - best public scores in the 60-70% range |
Saturation
MMLU was hard in 2020 and routine by 2024. Once a benchmark hits the human-expert ceiling, score deltas between models stop tracking real capability and start tracking prompt-engineering effort, evaluation harness choice, and overfitting to the public test split. The Hugging Face team showed that the same model evaluates at 49% or 64% on MMLU depending on which harness you run - prompt formatting, log-likelihood normalisation, and answer-extraction logic all shift the score by tens of points. A leaderboard table without a harness footnote is roughly meaningless.
Contamination
Internet-scraped pretraining data overlaps with the public test sets of every popular benchmark. Models can recognise the questions, sometimes verbatim, sometimes paraphrased through translation memorisation. The Sainz et al. EMNLP 2024 position paper argues current NLP evaluation is "in trouble" and that contamination is silently inflating scores and corrupting the literature. Decontamination at training time helps but cannot be verified by external readers, which is why the field keeps minting new benchmarks faster than models can saturate them.
The "frontier benchmark of the month" cycle
GPQA-Diamond, ARC-AGI, FrontierMath, Humanity's Last Exam - each becomes the headline metric for one or two release cycles, gets gamed or saturated, and is replaced. Treat any single benchmark as a leading indicator, not a verdict. The defensible play for an applied team is to track a basket of three or four current frontier benchmarks plus your own internal evaluation set.
Coding: HumanEval, HumanEval+, SWE-bench
HumanEval was the de-facto code benchmark for years; the original test suite averages around three tests per problem, which lets buggy code pass. EvalPlus blows the test count up 80x and shows that pass rates fall by 19-29% across most models - many earlier claims of "human-level coding" were artefacts of weak tests. SWE-bench raised the bar further by demanding the model fix real GitHub issues end to end. It is the closest open benchmark we have to "can this model be a junior engineer."
When it falls down
- Multiple choice rewards pattern matching. A model can pick the right letter from elimination heuristics without ever forming the underlying reasoning.
- Public benchmarks leak. If the test set is on the internet, assume it is in pretraining.
- Scores do not compose. 92% on MMLU and 90% on HumanEval do not predict 80% on your specific extraction or routing task.
- Reasoning models change the shape. OpenAI o-series and similar test-time-compute models can pay more tokens for a higher score, which makes wall-clock-equal comparisons unfair.