← Concept library

Evaluation & MLOps

HELM and Holistic Evaluation

Why a single accuracy number is gameable, and how Stanford's HELM, BIG-bench, and lm-evaluation-harness push evaluation toward a multi-axis picture.

intermediate · 7 min read

A single leaderboard number compresses everything you might care about - accuracy, robustness, calibration, bias, latency - into one scalar. That scalar is what gets optimised, gamed, and screenshotted. Holistic evaluation pushes back by insisting the model card show the whole vector.

The HELM framing

Stanford's Center for Research on Foundation Models released HELM (Holistic Evaluation of Language Models) in late 2022. The framing is deliberately broad: for each scenario, measure seven axes.

Axis What it captures
Accuracy Standard task performance
Calibration Does the model's confidence match its correctness?
Robustness Does paraphrasing, typos, or perturbation collapse the score?
Fairness Does performance differ across demographic groups?
Bias Does the output reflect stereotype patterns?
Toxicity Does the model emit harmful content under benign prompts?
Efficiency Tokens, latency, and energy per inference

The original HELM v1 evaluated 30 models on 42 scenarios. Before HELM, models on average had been evaluated on just 17.9% of those scenarios with little overlap between papers - the framework's biggest contribution was forcing apples-to-apples comparison across a common matrix.

Why multi-axis matters in production

A model that wins the accuracy column can lose the calibration column badly, which is exactly the failure mode that triggers production incidents - the model is confidently wrong on a tail slice. Robustness reveals which models break under realistic input noise (typos, low-resource phrasing, prompt drift). Efficiency separates "frontier capability you can afford" from "frontier capability you can demo."

Single-number leaderboards are gameable in obvious ways: train on the test format, prompt-tune to the eval harness, sample at high temperature with majority vote. A seven-axis report card is much harder to fake because optimising any one axis tends to cost you on the others.

BIG-bench

BIG-bench (Beyond the Imitation Game) is the sibling effort from Google: 204 tasks contributed by 450 authors across 132 institutions, deliberately weighted toward problems believed to be beyond contemporary model capabilities. The "hard" subset (BIG-bench Hard) became the workhorse - smaller, harder, and used as a standard reasoning probe through 2023-2024.

The de-facto eval runner: lm-evaluation-harness

EleutherAI's lm-evaluation-harness is the framework most labs actually run. It supports 60+ standard academic benchmarks with hundreds of subtasks, works against HuggingFace transformers, vLLM, OpenAI and Anthropic APIs, and is the backend for Hugging Face's Open LLM Leaderboard. If you are building an internal eval pipeline and want your numbers to be comparable to published ones, this is the framework to extend - not roll your own.

Trade-offs

  • HELM is expensive. Running the full matrix on a single model costs real money in API credits and GPU-hours. Most teams run a curated subset.
  • Coverage drifts. A benchmark added in 2022 to probe an emergent capability often saturates within 18 months. HELM and lm-eval-harness both have to keep retiring and adding scenarios.
  • Axes interact. A safety-tuned model can score lower on accuracy because it refuses borderline tasks. Without the joint view you misattribute the regression.

Further reading