Auditing Synthetic Data

A typical synthetic data pipeline can generate a million instruction pairs overnight. The speed is intoxicating, and the failure mode is correspondingly quiet: the model trains, the loss curves look fine, and weeks later someone discovers the eval benchmarks were in the training set, or every "diverse" question is a paraphrase of eleven seed examples, or a teacher model's systematic errors have been faithfully reproduced at scale. Auditing is the practice of catching these problems before they cost you a training run.

Why synthetic data needs different checks than curated data

Human-curated data fails in random, idiosyncratic ways. Synthetic data fails in systematic ways, because it is produced by a generator with its own biases, blind spots, and failure modes. Those patterns repeat across every sample the generator touches.

Three categories of failure dominate:

Category	What goes wrong	Observable symptom
Factual corruption	Teacher model hallucinates; distilled answers contain confident errors	Downstream model scores well on style evals, poorly on fact-check evals
Diversity collapse	Generator reuses surface patterns; near-duplicate instructions proliferate	High ROUGE overlap across the set; n-gram diversity falls
Benchmark contamination	Evaluation examples appear verbatim or near-verbatim in training data	Suspiciously high eval scores that do not generalise

Each category calls for a distinct check, and skipping any one can invalidate the others. A dataset can be highly diverse and completely contaminated; it can be factually clean and entirely made of near-duplicates.

The four audit dimensions

1. Factuality signals. For datasets built by distilling from a teacher, you need some ground-truth reference to verify against. Where a reference exists (maths problems, code, verifiable facts), run a symbolic checker or a separate verifier model. Where no reference exists, at minimum compare teacher confidence: low log-probability generations are disproportionately likely to be hallucinated. The paper "Best Practices and Lessons Learned on Synthetic Data" (Liu et al., COLM 2024) frames factuality, fidelity, and unbiasedness as the three pillars any synthetic corpus must satisfy before use.

2. Diversity and coverage. Embed all instructions with a small encoder (sentence-transformers is adequate) and measure:

Vocabulary entropy across the instruction set
Cluster analysis: k-means or DBSCAN on embeddings reveals if most samples collapse into a handful of semantic clusters
ROUGE-L self-similarity: the average maximum ROUGE-L score of each sample against the rest of the set; values above roughly 0.7 indicate near-duplication at scale

A concrete heuristic: for a 100k-sample instruction set, you should expect at least 5,000 distinct semantic clusters under reasonable k-means settings. Fewer is a red flag.

3. Deduplication. Near-deduplication at the instruction level matters independently of embedding-space diversity, because exact or near-exact duplicates inflate gradient updates for specific patterns. MinHash locality-sensitive hashing (as used in RefinedWeb and Dolma) is standard: Jaccard similarity above 0.8 on 13-gram shingles typically warrants removal. After deduplication the diversity metrics in step 2 often drop sharply, revealing that apparent diversity was hiding behind surface paraphrases.

4. Benchmark contamination. This is the most consequential check to get right. The procedure:

for each evaluation example e:
    for each training candidate t:
        score = longest_common_subsequence(normalise(e), normalise(t))
        if score / len(e) > threshold:
            flag t for removal

Thresholds vary by domain; 0.6 on token-normalised LCS works for open-domain QA. For code and maths, even short snippet matches warrant investigation. The "False Promise of Imitating Proprietary LLMs" (Gudibande et al., 2023) found that imitation models could look competitive on benchmarks while failing on out-of-distribution tasks, partly because evaluations themselves can leak into distilled datasets.

Auditing the constitutional and rejection-sampling pipelines

Self-play and constitutional AI loops (see the companion concepts on those topics) introduce a subtlety: the generator and the auditor can share the same model weights. A model that systematically hallucinates a type of fact will also fail to catch that hallucination in its own critic role. Two mitigations apply:

Cross-model verification. Use a different model family for the verifier than for the generator. If Llama generated the dataset, audit with a Mistral-based critic, not another Llama variant.
Held-out human spot-checks. Even at 0.5% sampling, a human reviewing 500 items from a 100k set will catch systematic failure modes that automated checks miss. The goal is not full coverage but calibration: if 12% of the spot-check sample is factually wrong, the automated checker is missing something.

For rejection-sampling pipelines specifically, check the rejection rate distribution. If 98% of generations are accepted, either the reward model has low discrimination or the task is so easy it produces no learning signal. Either way the resulting data is suspect.

Contamination from the teacher model's training data

A subtlety that trips up many practitioners: the teacher model used for distillation has already seen your evaluation benchmarks during its own pretraining. When you ask it to generate examples "like MMLU" or "similar to GSM8K", it can produce near-verbatim reproductions even without explicit access to the benchmark files. This indirect contamination is harder to detect with exact-match scanning. The practical defence is to audit not just against benchmark questions but against common benchmark formats and templates: distinctive multi-choice lettering schemes, characteristic problem structures, or recurrent proper nouns that flag GSM8K-style word problems.

When it falls down

The verifier is the generator. Constitutional AI loops and RLAIF critics trained on the same base model will share blind spots. If the base model has a systematic bias, the critic inherits it. No deduplication or diversity check catches this.

Scale defeats human checks. A 1B-sample corpus cannot be audited with spot-checks at any reasonable sampling rate. Automated checks at that scale are necessary but have their own systematic gaps (LCS thresholds are heuristics, embedding models have their own biases). There is no substitute for starting with a smaller, thoroughly audited seed set.

Self-consuming loops accumulate errors. Alemohammad et al. (2023) showed empirically that iterative training on synthetic data without re-injection of real data degrades both precision and diversity over generations (they term this Model Autophagy Disorder). Auditing a single generation's data is insufficient; you need a longitudinal view across pipeline iterations. Track aggregate diversity metrics and factuality rates across versions.

Style audit passes, capability audit fails. Human raters and LLM-as-judge evaluators score stylistic quality reliably but miss factual errors and capability gaps. Gudibande et al. (2023) demonstrated this explicitly: imitation models scored comparably to ChatGPT in crowdsourced human evals while failing substantially on knowledge-intensive tasks. An audit that only asks "does this look good?" will greenlight a corrupt dataset.

Contamination scanners are not recall-complete. Any threshold-based contamination scanner has recall gaps. The safe default is to treat the audit as a lower bound on contamination and build margins into your evaluation design (hold out evaluation sets that were never available to the generator, including indirectly via the teacher's training).

Why synthetic data needs different checks than curated data

The four audit dimensions

Auditing the constitutional and rejection-sampling pipelines

Contamination from the teacher model's training data

When it falls down

Further reading