Quality Filtering of Synthetic Data

Generating a million synthetic instruction pairs takes minutes with a capable teacher model. Training on all of them indiscriminately will make your model worse, not better. The problem is not data volume; it is signal density. A synthetic dataset is only as good as the proportion of its examples that are genuinely informative, diverse, and free from the teacher's own failure modes.

This concept covers the principal filtering techniques, why each one works, and the failure modes that no filter fully eliminates.

Why Raw Synthetic Output Needs Filtering

A language model asked to generate training data will do so fluently, but fluency is not correctness. Three pathologies are endemic to synthetic pipelines:

Duplication. Auto-regressive generation reuses phrasings and sentence patterns. Near-duplicate pairs add no information but inflate training loss toward a narrow surface of the distribution. Lee et al. (2022) showed that deduplicating training corpora reduces verbatim memorisation by roughly 10x and cuts the number of training steps needed to reach the same perplexity.
Teacher-ceiling contamination. Distilled datasets inherit the teacher's systematic errors. If GPT-4 confidently hallucinates a citation on a topic, every student trained on that example learns to hallucinate the same citation.
Format collapse. Instruction generators often over-produce one response style (bullet lists, JSON, numbered steps) because that style had high reward in the teacher's training. The student learns surface formatting more than underlying capability.

Filtering does not eliminate these problems, but it bounds their impact.

Heuristic Filters: Fast and Coarse

The cheapest filters are rule-based and run before any model inference:

Filter type	What it catches	Typical threshold
Length cutoff	One-word answers; truncated responses	min 20 tokens, max 2048
ROUGE-L deduplication	Near-duplicate (instruction, response) pairs	ROUGE-L > 0.7 treated as duplicate
Keyword/regex blocklist	Refusals ("As an AI, I cannot..."), boilerplate	Exact-match list
Language-ID	Non-target-language outputs	fastText LangID confidence > 0.9
Format ratio	Responses that are >80% markdown syntax	Empirical threshold

Self-Instruct (Wang et al., 2023, arXiv:2212.10560) used ROUGE-L overlap against the existing pool to discard instructions too similar to ones already retained, keeping diversity high with a single cheap metric.

The limitation is obvious: heuristics do not measure correctness. A confident, well-formatted, on-topic answer that is factually wrong passes every rule-based filter.

Reward Model and LLM-as-Judge Scoring

For instruction-following datasets, a trained reward model (RM) can score each (instruction, response) pair and discard the bottom percentile. This is how rejection-sampling fine-tuning (RSF) works in practice: generate K candidates per prompt, score all K, keep only those above a threshold or take the top-1.

The scoring signal can come from:

A dedicated RM trained on human preference comparisons (the standard RLHF approach from Ouyang et al., arXiv:2203.02155).
LLM-as-judge, where a strong model evaluates each response on a rubric and returns a numerical rating. Magpie (arXiv:2406.08464) used this pattern: 4 million synthetic instructions were generated from Llama-3-Instruct, then scored and filtered down to 300K for fine-tuning - the smaller curated set outperformed the full unfiltered one on standard benchmarks.
Self-curation, where the same model that generated the data rates its own outputs. Instruction backtranslation (Li et al., arXiv:2308.06259) is the cleanest example: the model scores each (web document, synthetic instruction) pair on a 1-5 scale and only the highest-rated examples enter training.

Pseudo-code for a threshold filter using an RM:

def filter_by_rm(pairs, reward_model, threshold=0.6):
    kept = []
    for instruction, response in pairs:
        score = reward_model.score(instruction, response)
        if score >= threshold:
            kept.append((instruction, response))
    return kept

The key design choice is whether to use a fixed threshold or a percentile cutoff. A fixed threshold fails when the RM distribution shifts across topics; a percentile cutoff (e.g., top 30%) is more robust but discards a fixed fraction regardless of absolute quality.

Difficulty-Aware Selection: IFD Scoring

A subtler issue is that easy examples waste gradient steps. If the student model already answers a prompt correctly, training on that pair moves the weights very little. Filtering for difficulty - keeping examples the current student finds hard - improves data efficiency.

The Instruction Following Difficulty (IFD) score operationalises this idea. For each pair (x, y), compute:

IFD(x, y) = PPL_model(y | x) / PPL_model(y)

Where PPL is perplexity under the current student. A high ratio means the response is much harder to predict when the instruction is given than in isolation - suggesting the instruction adds strong conditioning that the model has not yet learned to use. Low-IFD pairs are filtered out.

This connects to the LIMA finding (Zhou et al., arXiv:2305.11206): 1,000 carefully selected, diverse, high-quality examples outperformed far larger datasets in instruction alignment, because they were concentrated at the difficulty level where gradient updates were most informative.

Deduplication at Scale

Near-duplicate removal is not just about efficiency; it is about preventing the model from memorising surface forms. The standard approach is MinHash locality-sensitive hashing (LSH), which approximates Jaccard similarity across character n-grams:

Tokenise each document into character 5-grams.
Compute a MinHash signature (typically 128 hash functions).
Use LSH banding to find candidate pairs with signature similarity above a threshold (e.g., 0.8).
For each duplicate cluster, retain one representative.

Lee et al. (arXiv:2107.06499) reported that this procedure on C4 and other corpora removes a substantial fraction of near-duplicate sequences (a 61-word sentence repeated over 60,000 times in C4, for instance) and produces models with lower memorisation rates and no loss in downstream accuracy.

For synthetic data specifically, deduplication should happen at the instruction level, the response level, and across (instruction, response) pairs jointly, because the same prompt can appear with distinct responses that are themselves duplicates.

When It Falls Down

Reward hacking by the filter itself. If you train on RM-filtered data and then iterate (train a new RM on the updated model's outputs), the student quickly learns to maximise RM scores rather than task quality. This is the core instability of the constitutional loop and self-improvement pipelines: each filtering step is only as good as the alignment between the filter's objective and the actual task objective.

Distribution shift between teacher and target. Phi-1 (Gunasekar et al., arXiv:2306.11644) demonstrated that filtering web text for "textbook quality" using a classifier produces a powerful coding model from 6B tokens. But the classifier was trained on human judgements of educational value - it cannot generalise reliably to domains far from its training distribution.

No filter catches all factual errors. LLM-as-judge ratings correlate with fluency and instruction-following, not with factual accuracy on specialised domains. A hallucinated statistic in a well-structured response will typically outscore a correct but awkwardly phrased one.

Filtering can introduce bias. Content classifiers trained on majority-web data consistently down-score minority-dialect text as "low quality" (Dodge et al., arXiv:2104.08758). Applying them to synthetic data without demographic auditing silently narrows the style distribution.

Collapse risk in iterative loops. When a model's own filtered outputs become the next round's training data, the filtering criteria determine what gets amplified. Small biases in the filter compound across generations, eventually producing a model that confidently generates a narrow style even when it is wrong - the synthetic data equivalent of mode collapse in GANs.