Benchmark Decontamination

When GPT-3 was released, OpenAI reported that several evaluation datasets suffered from "methodological issues related to training on large web corpora." The paper flagged the problem honestly, but the decontamination procedure they used - removing 13-gram overlaps between the training set and each benchmark - still left intact any contaminated example that had been paraphrased, translated, or slightly reformatted. That gap between intent and implementation is exactly what benchmark decontamination tries to close.

What contamination actually means

A benchmark tests whether a model can generalise to unseen inputs. Contamination violates that assumption: the model has, during pretraining, processed text that is identical or near-identical to a test question or its answer. Because LLMs are trained on internet-scale corpora assembled from Common Crawl, GitHub, StackExchange, and thousands of other sources, any publicly released dataset is a candidate for inclusion.

Contamination exists on a spectrum:

Level	What leaked	Effect on score
Exact match	Verbatim question + answer	Severe; model recalls answer
Near-duplicate	Minor edits, different punctuation	Moderate; partial recall
Rephrased	Semantically equivalent restatement	Subtle; style changed, content intact
Indirect	Explanatory text referencing the answer	Hard to detect; inflates reasoning scores

The practical damage is asymmetric: a 13B model trained on contaminated data for a specific benchmark can match GPT-4 on that benchmark while performing normally elsewhere. Yang et al. (2023) demonstrated this by injecting rephrased HumanEval and MMLU examples into a fine-tuning set and observing benchmark scores jump to levels inconsistent with the model's general capability.

The standard pipeline: n-gram deduplication

The classical approach, used in GPT-3 and adopted by many subsequent models, works as follows:

Tokenise every benchmark example (question, optionally answer) into a sequence of tokens or words.
Extract all contiguous n-grams of length k (commonly k=8 or k=13).
Build a lookup structure (a Bloom filter or an exact hash set) over these n-grams.
Scan every document in the training corpus; flag any document that contains at least one matching n-gram.
Remove (or quarantine) flagged documents entirely, or excise the contaminating spans.

The choice of k matters. Small k (e.g., k=5) over-flags common phrases and removes benign content; large k (e.g., k=20) misses paraphrases. A value around k=13 is a rough community consensus, though no principled derivation backs that number.

Pseudocode for the n-gram matching step:

def build_ngram_set(examples: list[str], k: int) -> set[tuple]:
    ngrams = set()
    for ex in examples:
        tokens = ex.lower().split()
        for i in range(len(tokens) - k + 1):
            ngrams.add(tuple(tokens[i:i+k]))
    return ngrams

def is_contaminated(doc: str, ngram_set: set, k: int) -> bool:
    tokens = doc.lower().split()
    for i in range(len(tokens) - k + 1):
        if tuple(tokens[i:i+k]) in ngram_set:
            return True
    return False

This runs in O(N) per document and scales to trillion-token corpora with a Bloom filter front-end. The cost is false negatives on paraphrase contamination and false positives on boilerplate phrases.

LLM-based semantic decontamination

String matching cannot catch semantically equivalent reformulations. Yang et al. (2023) showed that 8 to 18 percent of HumanEval problems overlap with RedPajama and StarCoder pretraining corpora even after standard n-gram decontamination - because the contaminating documents rephrase the problem rather than copy it verbatim.

A stronger approach prompts a language model to judge whether a candidate training document could reveal the answer to a benchmark question:

System: You are a data auditor.
User: Benchmark question: "What does the following Python function return
      for input 3? [code snippet]"
      Training document: [candidate text]
      Does reading this document give away the answer? Reply YES or NO with
      one sentence of justification.

This catches paraphrase and translation variants that n-gram methods miss. The cost is substantial: running an LLM judge over billions of candidate pairs requires embedding-based pre-filtering to reduce comparisons to a tractable set (e.g., retrieve the top-k most similar documents per benchmark example using approximate nearest-neighbour search, then invoke the LLM judge only on that shortlist).

The two-stage pipeline looks like:

Benchmark examples
       |
  Embed with a fast encoder (e.g., sentence-transformers)
       |
  ANN search over corpus embeddings
       |
  Top-k candidate documents per example
       |
  LLM judge: contaminated? (YES/NO)
       |
  Remove flagged documents from training set

Practical considerations in corpus construction

Several decisions made upstream of decontamination affect how much contamination there is to remove.

Snapshot timing. If you crawl the web after a benchmark is released (most are public on Hugging Face or Papers With Code), you will pick up pages that reproduce the benchmark. Crawling before release is not always feasible, but tracking which benchmarks existed at crawl time is.

Answer-only vs. question-answer removal. Some practitioners remove documents containing the answer text only, preserving documents that contain the question without the answer. This is defensible for multiple-choice benchmarks (knowing the question does not reveal the correct option) but questionable for open-ended generation tasks.

Benchmark-specific thresholds. Code benchmarks (HumanEval, MBPP) require more aggressive decontamination because exact solutions are widely posted on GitHub and StackOverflow. Mathematical benchmarks (GSM8K, MATH) are slightly more robust because solution steps vary, but step-by-step walkthrough posts still pose a risk.

Held-out validation. After decontamination, measure the fraction of benchmark n-grams that still appear in the corpus. A residual rate above roughly 1 percent warrants investigation. Publishing this figure alongside benchmark results is now considered good practice; Sainz et al. (2023) argue it should be mandatory.

When it falls down

Semantic contamination is invisible to n-gram methods. A Stack Overflow post explaining the exact reasoning needed to solve a GSM8K problem, in different words, will pass every string-matching filter. Without an LLM judge or human review, the contamination is undetected.

Decontamination is not idempotent across versions. A corpus decontaminated against BIG-Bench v1 is not decontaminated against BIG-Bench v2, which adds new tasks. Benchmark suites evolve; the corpus must be re-audited when evaluation sets change.

Synthetic data self-contaminates. Corpora that include model-generated text risk inheriting contamination from the generator. If GPT-4 was used to produce synthetic training data, and GPT-4's training included benchmark answers, those answers may appear in your synthetic documents. Yang et al. (2023) noted exactly this risk when examining synthetically generated datasets.

Decontamination removes signal, not just noise. Aggressive removal can strip legitimate educational content - worked examples of algorithms, chemistry problem solutions, historical reasoning chains - that genuinely helps pretraining. The trade-off between contamination risk and data quality loss is not zero-cost.

Post-hoc detection is unreliable. Membership inference attacks and log-probability probing can sometimes detect whether a specific example was in training data, but false-negative rates are high for models trained on very large corpora. You cannot reverse-engineer a clean benchmark result from a model trained on a contaminated corpus.

What contamination actually means

The standard pipeline: n-gram deduplication

LLM-based semantic decontamination

Practical considerations in corpus construction

When it falls down

Further reading