When the Judge Is Also a Player: LLM-as-Judge, Contamination, and Why Leaderboards Drift

In June 2023, a team at LMSYS published a paper with an uncomfortable proposal: stop trying to write reference answers for open-ended questions, and instead let a strong language model decide which of two responses is better. They measured how often GPT-4, acting as a judge, agreed with human annotators, and found agreement above 80%, the same rate at which two humans agreed with each other (Zheng et al., 2023, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685). That single result reshaped how the field evaluates models. Within two years, almost every alignment paper, every model card, and every leaderboard leaned on an LLM judge somewhere in its pipeline.

The same paper, in the same breath, listed the reasons not to trust it: position bias, verbosity bias, self-enhancement bias, and limited reasoning. Those caveats were quieter than the headline. They should not have been. A judge that can be swayed by which answer comes first, by which answer is longer, or by which model produced it is not measuring quality alone. Layer on a second problem, that the questions themselves may have leaked into training data, and you get the central puzzle of modern evaluation: a leaderboard number can move by several points while nothing about the model's actual capability has changed.

Why this matters: Most teams now ship models partly on the strength of an automatic eval, a single score from a judge model or a public benchmark. If that score reflects answer length and prior exposure as much as reasoning, you are optimizing the proxy, not the thing. Knowing exactly how the proxy lies is the difference between a real improvement and a number that fooled you.

TL;DR

LLM-as-judge replaced reference-based scoring for open-ended tasks because it correlates well with human preference (over 80% agreement for GPT-4 on MT-Bench), but it inherits systematic, measurable biases that reward-controlled studies have since quantified.
Position bias is real and not random: in a study of 15 judges over 150,000+ comparisons, judges frequently changed their verdict when the two answers were swapped, and the effect grows as the quality gap shrinks (Shi et al., 2024, arXiv:2406.07791).
Verbosity bias is large enough that controlling for length raised AlpacaEval's correlation with human rankings from 0.94 to 0.98, and made the metric far harder to game by padding answers (Dubois et al., 2024, arXiv:2404.04475).
Self-preference is a genuine confound: models recognize their own outputs, and the strength of that recognition correlates linearly with how much they over-score themselves (Panickssery et al., 2024, arXiv:2404.13076).
Contamination is the second leak. Simple n-gram decontamination misses paraphrased and translated test items, and a 13B model trained on rephrased benchmarks reached GPT-4-level scores it had not earned (Yang et al., 2023, arXiv:2311.04850).
A clean re-test tells the truth: when Scale AI built GSM1k to mirror GSM8k, several model families dropped by up to roughly 8% to 13%, and the drop tracked how often a model could reproduce GSM8k verbatim (Zhang et al., 2024, arXiv:2405.00332).
Mitigations exist (swap-and-average, length control, held-out judges, fresh test sets), but each trades cost or coverage for trust. There is no free, contamination-proof, bias-free automatic eval.

At a Glance

A modern evaluation result is the product of two pipelines stacked on top of each other: the benchmark that supplies the questions, and the judge that scores the answers. Either can quietly distort the final number.

flowchart LR
  Q[Benchmark questions] --> M[Model under test]
  M --> A[Candidate answers]
  A --> J[LLM judge]
  J --> S[Leaderboard score]
  C[Training data leak] -.distorts.-> M
  B[Judge biases] -.distorts.-> J
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  class Q,A blue
  class M,J purple
  class S teal
  class C,B rose

The two dotted arrows are the whole story. The solid path looks rigorous; the dotted path is where the score detaches from capability.

Before Automatic Judges

For most of NLP's history, evaluation meant matching a model's output against a fixed reference. BLEU scored translations by n-gram overlap with human translations; ROUGE did the same for summaries; exact-match and F1 graded extractive question answering. These metrics were cheap, deterministic, and reproducible, and they were fine when the right answer was a short span of text.

[IMAGE: Before/after comparison panel, left shows BLEU scoring a paraphrased-but-correct answer near zero, right shows an LLM judge correctly rating it high, annotated with the n-gram overlap count]

They broke as soon as models started producing open-ended prose. There are thousands of good ways to answer "explain why the sky is blue at a sixth-grade level," and almost none of them share many n-grams with any single reference. Overlap metrics punish valid paraphrases and reward shallow keyword stuffing. By 2022, the gap between what BLEU measured and what users actually preferred had become embarrassing.

timeline
  title Evolution of open-ended LLM evaluation
  2002 : BLEU and overlap metrics
  2020 : GPT-3 popularizes 13-gram contamination checks
  2023 : MT-Bench and Chatbot Arena, LLM-as-judge
  2024 : Bias and contamination studies quantify the failure modes
  2026 : Contamination-resistant and held-out evaluation becomes standard practice

Two responses to this gap emerged in 2023, and they are still the two poles of the field. Chatbot Arena collected live human votes between anonymous model pairs and aggregated them with a Bradley-Terry model, the maximum-likelihood version of Elo, which estimates each model's latent strength from pairwise win-loss records (Chiang et al., 2024, Chatbot Arena, arXiv:2403.04132). MT-Bench took the cheaper route: replace the human voter with GPT-4. The Arena is slow and expensive but grounded in real preference; the judge is fast and scalable but inherits a model's quirks. Most of the field reached for the judge.

How LLM-as-Judge Actually Works

Stripped to its mechanics, an LLM judge is a prompt. You give a capable model a question, one or two candidate answers, and a rubric, and you ask it to either score each answer on a scale or pick the winner. The two dominant formats are pointwise scoring (rate this answer 1 to 10) and pairwise comparison (which of A or B is better). Pairwise tends to be more reliable, because relative judgments are easier than absolute calibration, but pairwise is exactly where position bias lives.

[IMAGE: Side-by-side schematic of pointwise vs pairwise judging prompts, annotated to show where position bias (pairwise) and calibration drift (pointwise) enter]

The judging loop

sequenceDiagram
  participant U as Eval harness
  participant J as Judge model
  participant Agg as Aggregator
  U->>J: Question plus answer A then answer B
  J-->>U: Verdict, A wins
  U->>J: Same question, answer B then answer A
  J-->>U: Verdict, B wins
  Note over U,Agg: Disagreement signals position bias
  U->>Agg: Average both orderings
  Agg-->>U: Debiased pairwise result

[IMAGE: Heatmap of judge verdicts by answer position (A vs B) across several judge models, cells colored by win rate, revealing the off-diagonal asymmetry that signals position bias]

That swap-and-average step is the cheapest defense against position bias and the single most important line in a serious eval harness. If a judge picks the first answer both times, regardless of which model produced it, the judge is reading position, not quality. The MT-Bench authors documented exactly this and recommended running each comparison in both orders (Zheng et al., 2023, arXiv:2306.05685).

Why the biases exist, not just that they do

Position bias is partly an artifact of how transformers attend to context. Models do not treat all positions in a prompt equally; the well-known "lost in the middle" effect shows accuracy sagging for information buried between the start and end of a long context (Liu et al., 2023, Lost in the Middle, arXiv:2307.03172). When two candidate answers occupy different slots, the judge's attention profile is not symmetric across them, so the prior over "which slot wins" is nonzero before a single token of content is read.

Verbosity bias has a simpler root. Instruction-tuned models are trained on human preference data, and human raters, on average, reward thorough-looking answers. The judge learned that correlation and now applies it even when the extra length adds nothing. The fix is not to tell the judge "ignore length"; that barely works. It is to model the length effect statistically and subtract it, which is what length-controlled AlpacaEval does (Dubois et al., 2024, arXiv:2404.04475).

Self-preference is the subtlest. Panickssery and colleagues showed that GPT-4 and Llama 2 can distinguish their own generations from other models' and from human text at well-above-chance rates, and that fine-tuning to strengthen this self-recognition strengthened self-preference in lockstep, a near-linear relationship (Panickssery et al., 2024, arXiv:2404.13076). A complementary study tied the effect to perplexity: judges score low-perplexity, familiar-looking text higher than humans do, and a model's own outputs are exactly the text it finds least surprising, with GPT-4o assigning roughly 10% higher scores to its own outputs (Wataoka et al., 2024, Self-Preference Bias in LLM-as-a-Judge, arXiv:2410.21819). If you generate candidates and judge them with the same model family, part of your score is measuring familiarity.

[IMAGE: Scatter plot of self-recognition accuracy (x) vs self-preference strength (y) showing the near-linear trend, with GPT-4 and Llama 2 points labeled]

The Second Leak: Contamination

Bias corrupts the judge. Contamination corrupts the questions. The two are independent, and a result can suffer from both.

Contamination is simple to state: the test items, or close variants, appeared in the model's training data, so the model is partly recalling rather than reasoning. The GPT-3 paper already took this seriously, defining contamination as a 13-gram overlap between a test instance and the training corpus and reporting decontaminated scores (Brown et al., 2020, Language Models are Few-Shot Learners, arXiv:2005.14165). The trouble is that string matching is brittle. Paraphrase a question, translate it and translate it back, or swap the numbers in a math problem, and a 13-gram filter sees nothing while the model still benefits from having seen the original.

flowchart TD
  T[Test item] --> O{n-gram overlap with training data}
  O -->|exact match found| D[Flagged and removed]
  O -->|no match| P[Assumed clean]
  P --> R[But a paraphrase was in training]
  R --> L[Score inflated, leak undetected]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  class T blue
  class O purple
  class D emerald
  class P,R amber
  class L rose

Yang and colleagues demonstrated the failure end to end. They took benchmark test sets, rephrased them so no n-gram filter would catch the overlap, trained a 13B model on the rephrased versions, and watched it reach scores comparable to GPT-4 on those benchmarks, scores it plainly had not earned through capability. Their proposed fix replaces string matching with an LLM-based decontaminator that flags semantic, not lexical, overlap (Yang et al., 2023, Rethinking Benchmark and Contamination for Language Models with Rephrased Samples, arXiv:2311.04850).

[IMAGE: Paired dot plot, each model's GSM8k score connected by a line to its GSM1k score, with the steepest drops highlighted to show which families overfit]

The cleanest evidence that contamination inflates public scores came from building a fresh test set. Scale AI commissioned GSM1k, 1,000 new grade-school math problems written to match the style and difficulty of the widely used GSM8k. Re-evaluating frontier and open models on the unseen mirror revealed accuracy drops of up to roughly 8% in the later analysis, with earlier versions reporting up to 13% for some families, and crucially the size of each model's drop correlated with how readily it could reproduce GSM8k examples verbatim, the fingerprint of memorization rather than skill (Zhang et al., 2024, A Careful Examination of LLM Performance on Grade School Arithmetic, arXiv:2405.00332). Many frontier models showed minimal drop, which is itself a useful result: contamination is real but not universal, and a clean re-test is how you tell the difference.

By the Numbers

The figures below come from the cited studies. Treat them as the published values for those specific setups, not universal constants; bias magnitude depends heavily on judge, task, and the quality gap between candidates.

Phenomenon	Measured effect	Source
GPT-4 judge vs human agreement (MT-Bench)	Over 80%, matching human-human agreement	Zheng et al., 2023, arXiv:2306.05685
Position bias study scale	15 judges, 22 tasks, ~40 models, 150,000+ comparisons	Shi et al., 2024, arXiv:2406.07791
Length control on AlpacaEval	Spearman correlation with Chatbot Arena 0.94 to 0.98	Dubois et al., 2024, arXiv:2404.04475
Self-preference (GPT-4o on own outputs)	~10% higher scores than warranted	Wataoka et al., 2024, arXiv:2410.21819
Rephrased-contamination overfit	13B model reaches GPT-4-level scores it did not earn	Yang et al., 2023, arXiv:2311.04850
Clean re-test drop (GSM8k to GSM1k)	Up to ~8% to 13% for some model families	Zhang et al., 2024, arXiv:2405.00332
Contamination flag definition (GPT-3)	13-gram overlap	Brown et al., 2020, arXiv:2005.14165

Two of these numbers deserve a second look. The 0.94 to 0.98 jump from length control sounds modest until you remember what it implies: roughly a quarter of the residual disagreement between an automatic metric and human ranking was attributable to length alone. And the position-bias study's headline is not a single percentage but a finding about structure, that bias strengthens as the two answers get closer in quality, which is precisely the regime where you most need the judge to be right (Shi et al., 2024, arXiv:2406.07791).

[IMAGE: Grouped bar chart of position-consistency rate by judge model, ordered, with a reference line at 100% consistency to show how far each falls short]

A Concrete Example

Walk one comparison through the full pipeline. You are evaluating two models, M1 and M2, on the prompt: "Explain why merge sort is \(O(n \log n)\)." You will use a third model, J, as judge in pairwise mode.

M1 answers in four tight sentences: it states the recurrence \(T(n) = 2T(n/2) + O(n)\), applies the master theorem, and concludes \(O(n \log n)\). Correct and complete.

M2 answers in three paragraphs: the same correct argument, plus a restatement of what Big-O means, an analogy about splitting a deck of cards, and a closing summary. Also correct, but padded.

Run the judge naively, M1 in position A and M2 in position B:

Step	State
Prompt to J	"Question. Answer A (M1, short). Answer B (M2, long). Which is better?"
J verdict	B wins, cites "more thorough and complete"
Hidden driver	Verbosity bias plus position B advantage

Now apply the two cheap defenses. First, swap and re-ask:

Step	State
Prompt to J	"Question. Answer A (M2, long). Answer B (M1, short)."
J verdict	A wins, again the long answer
Reading	Consistent across swap, so this is verbosity, not position

Position bias is ruled out, the judge agreed across orderings. But verbosity bias is still live: the judge prefers M2 in both orderings purely because it is longer. If you stopped at swap-and-average you would still hand M2 the win. Apply length control, condition the preference on the two answers being the same length, and the spurious advantage collapses; the judge now sees two equivalent correct arguments and the comparison is a near-tie, which is the honest answer. The lesson is that no single fix is sufficient. Swap-and-average kills position bias and does nothing for length; length control does the reverse.

[IMAGE: Annotated trace showing the same comparison under naive judging, swap-and-average, and length-controlled judging, with the verdict flipping from "M2 wins" to "tie"]

Now add contamination on top. Suppose the merge-sort question happened to appear, verbatim, in M2's fine-tuning data. M2's "reasoning" is then partly recall. A semantic decontaminator scanning M2's training set would flag the overlap; an n-gram filter, if the question were lightly paraphrased, would not. The only fully reliable check is to ask a question you are certain neither model has seen, which is why fresh, held-out test sets remain the gold standard.

Where It Breaks

The failure modes compound, and the worst ones are invisible from the final number alone.

Swap-and-average doubles your judging cost and only addresses ordering. It says nothing about whether the judge understood the question. On hard reasoning tasks, judges are weak in exactly the way the candidates are weak, so a judge can confidently endorse a wrong answer that "looks" rigorous.

Length control assumes length is a confound you can regress out cleanly. When length is genuinely informative, a thorough answer to a complex question really is better, over-correcting penalizes the right model. The method shifts the question from "is the judge fooled by length" to "did you model the length effect correctly," which is progress but not a solved problem.

Self-preference cannot be swapped or regressed away. If your judge and your candidate share a model family, the bias is baked into the comparison. The only structural fix is a judge from a different lineage, or a panel of judges, which raises cost and introduces disagreement you then have to adjudicate.

Contamination has a recursive trap: the moment a benchmark becomes popular enough to matter, it is scraped, discussed, and reposted across the web, so the next training run ingests it. A benchmark's usefulness has a half-life. The famous ones (MMLU, GSM8k, HumanEval) are the most likely to be contaminated precisely because they are famous. MMLU contamination in scraped corpora was reported rising sharply over a three-year span, a direct consequence of its own success.

[IMAGE: Line plot of estimated benchmark contamination rate over time for a popular benchmark, showing the upward creep as the benchmark ages and spreads]

Finally, there is the meta-failure: optimizing against the judge. Once a team trains or selects models to win a particular automatic eval, the eval stops measuring capability and starts measuring eval-gaming, an instance of Goodhart's law. Length padding to beat a verbosity-prone judge is the textbook case, and it is exactly what length-controlled AlpacaEval was built to neutralize.

Alternative Designs

No single evaluation method dominates. The realistic choice is a portfolio, weighted by what you can afford and how much you need to trust the result.

Approach	Strengths	Weaknesses	Best when
Reference-overlap (BLEU, ROUGE, exact match)	Cheap, deterministic, reproducible	Fails on open-ended text, rewards surface overlap	Short, closed-form answers
LLM-as-judge (single)	Scalable, correlates with human preference, explainable	Position, verbosity, self-preference biases	Fast iteration, large answer volume
LLM-as-judge (swap plus length control plus cross-family)	Neutralizes the known biases	2x or more cost, still weak on hard reasoning	High-stakes automatic eval
Human pairwise (Chatbot Arena style)	Grounded in real preference, hard to game	Slow, expensive, noisy, low coverage per dollar	Final model ranking, ground truth
Fresh held-out benchmark (GSM1k style)	Contamination-proof by construction	Expensive to build, one-time use, narrow scope	Auditing suspected contamination

The pattern is that trust costs money. The cheapest methods (overlap, single judge) are the easiest to fool; the most trustworthy (human votes, fresh test sets) are the slowest and least scalable. Most production teams sit in the middle: a debiased LLM judge for the fast loop, periodically calibrated against a smaller human-labeled set, with a fresh held-out set held in reserve for when a result looks too good.

graph TD
  subgraph FastLoop[Fast iteration loop]
    J1[Debiased LLM judge] --> Score1[Daily eval score]
  end
  subgraph Calibration[Periodic trust check]
    H[Human-labeled set] --> Cal[Calibrate judge]
    Fresh[Fresh held-out set] --> Audit[Contamination audit]
  end
  Cal -.adjusts.-> J1
  Audit -.validates.-> Score1
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  class J1,Score1 purple
  class H,Fresh blue
  class Cal,Audit emerald

How It Is Used in Practice

The major leaderboards have converged on layered designs that acknowledge these failures rather than hiding them. Chatbot Arena, run by LMArena (formerly LMSYS), keeps humans in the loop and aggregates their pairwise votes with a Bradley-Terry model, which produces transitive, stable rankings and confidence intervals rather than a raw win rate (Chiang et al., 2024, arXiv:2403.04132). Because beating a strong opponent moves a model's rating more than beating a weak one, the ranking resists the noise that plain Elo accumulates from match ordering. The cost is throughput: human votes arrive slowly, so the Arena lags fast iteration.

AlpacaEval ships the length-controlled win rate as its headline metric, not the raw one, an explicit admission that the raw automatic score was gameable by padding (Dubois et al., 2024, arXiv:2404.04475). Internal eval harnesses at frontier labs typically run judges in both orderings by default, use a judge from a different family than the model under test when feasible, and reserve a private, never-published test set for final sign-off, the institutional version of GSM1k.

The contamination-resistant movement is the newest layer. Rather than chasing leaks after the fact, the proposal is to design benchmarks that are hard to contaminate in the first place: generate fresh items on demand, keep canonical answers private, and rotate the public sample. The principle is that a benchmark's integrity should not depend on the test set staying secret, since secrecy never holds.

[IMAGE: Architecture diagram of a production eval stack, with layers for fast judge loop, human calibration, contamination audit, and private final-sign-off set, annotated with cost and latency per layer]

Insights Worth Remembering

A judge that agrees with humans 80% of the time is not a judge that is right 80% of the time; it is right exactly where humans are easy to predict, and the remaining 20% clusters on the hard, close cases you most care about.

Position bias and verbosity bias require opposite fixes, and neither fixes the other. A harness that only swaps orderings has solved half the problem and may not know it.

Self-preference is the one bias you cannot subtract after the fact. The defense is architectural: judge with a different lineage than you generated with.

Contamination and capability produce the same high score. The only way to tell them apart is to ask a question you are certain is new, which is why a fresh test set is worth more than a clever post-hoc filter.

A benchmark's fame is its expiry date. The more a test set matters, the faster it leaks into training corpora, so today's hardest benchmark is tomorrow's memorized trivia.

Every reliability gain in evaluation is bought with cost or coverage. There is no free, unbiased, contamination-proof automatic metric, and a vendor claiming one is selling the proxy, not the property.

The moment you optimize a model to win an eval, the eval stops measuring what it measured. Treat any metric you train against as compromised the day you start training against it.

Open Questions

How far can debiased automatic judges be pushed before they hit a hard ceiling on reasoning-heavy tasks? The evidence shows judges match humans on preference-style comparisons, but it is an open question, not a settled one, whether a judge can reliably grade a proof or a subtle factual claim it could not itself produce.

Can contamination be detected from the outside, without access to training data? Methods based on output behavior (memorization probes, perplexity gaps between original and perturbed items, in-context detection) are an active research area, but none is yet a reliable, model-agnostic audit. For closed models whose training data is undisclosed, this remains largely unsolved.

Will contamination-resistant, dynamically generated benchmarks hold up, or will models simply learn the generator's distribution? The approach is promising and increasingly adopted, but whether a model can overfit to the style of a procedural benchmark, the way it once overfit to specific items, is not yet known.

Is a multi-judge panel meaningfully more reliable than a single debiased judge, or does it just average correlated errors? Panels reduce some idiosyncratic bias, but if the judges share training data and architecture, their mistakes correlate, and an average of correlated errors is still an error. Quantifying that residual correlation is open work.

How should a leaderboard report uncertainty so that a 1-point gap is not read as a real difference? The Bradley-Terry confidence intervals on Chatbot Arena are a step, but most published comparisons still report point estimates that invite over-interpretation of differences well within noise.