Quality Filtering with Classifiers

Roughly 80% of the tokens in a raw Common Crawl dump are noise: scraped navigation menus, garbled OCR, templated boilerplate, and gibberish that no heuristic rule reliably catches. Heuristic filters help at the margins, but the hard problem is distinguishing a coherent, informative paragraph from one that merely passes surface-level checks. Classifier-based quality filtering addresses this directly: train a model to discriminate between high-quality reference text and raw web text, then use its scores as a continuous quality signal at pipeline scale.

The Core Idea: Reference Corpora as Proxy Labels

Every quality classifier needs a notion of "good". The standard construction borrows it from corpora that already went through human editorial processes: Wikipedia, books, or peer-reviewed papers. Documents from these sources become positive examples; documents sampled randomly from the web become negative examples. A classifier trained on this binary task learns to approximate the latent quality signal those sources embody.

CCNet (Wenzek et al., 2020) popularised this pattern. They trained a fastText language model on Wikipedia to estimate how likely a web document was to come from Wikipedia-quality writing, then bucketed documents by their perplexity scores - low perplexity (high confidence) documents were retained; high perplexity ones were discarded. The same principle underlies the quality filter in GPT-3's data pipeline: a binary classifier trained on WebText (Reddit upvoted links, a rough quality proxy) versus raw CommonCrawl, retaining only documents the classifier scored above a threshold.

The key insight is that you never directly label web documents as good or bad. You use editorial proxies, and the classifier generalises from those proxies to the rest of the web.

Lightweight Architectures Matter at Scale

Processing trillions of tokens means the classifier must be fast. Two families dominate:

Family	Example	Throughput	Typical use
Bag-of-n-grams (logistic regression / fastText)	CCNet perplexity filter	Millions of docs/minute on CPU	First-pass, language-aware filtering
Small neural classifier	BERT-base trained on quality labels	~50k docs/min on GPU	Second-pass, higher precision
Importance resampling (DSIR)	Hashed n-gram features + KL weighting	100M docs in ~4.5 hours	Distribution-matching to a target

For a 15-trillion-token corpus, even a 20% reduction in processing speed compounds enormously. FastText and n-gram logistic classifiers run on CPU clusters with no GPU cost, making them the practical default for a first-pass filter. BERT-scale classifiers are reserved for high-value subsets where precision matters more than throughput.

Scoring Strategies: Hard Cutoff vs. Soft Resampling

Once you have per-document scores, you must decide how to use them. Two strategies exist:

Hard threshold filtering. Documents scoring below a fixed percentile are dropped. GPT-3's pipeline kept the top 60% by classifier score; CCNet partitioned into three quality buckets (head, middle, tail) and used only the head. Simple, fast, but binary: a document scoring at the 59th percentile gets treated identically to one scoring at the 1st.

Importance resampling (DSIR). Rather than a hard cut, DSIR (Xie et al., NeurIPS 2023) assigns each document an importance weight proportional to how much it increases the similarity between the training distribution and a target distribution (Wikipedia + books). Documents are then sampled proportionally to their weights. This preserves more of the web's diversity while biasing the mixture toward the target distribution, and avoids the cliff-edge behaviour of hard thresholds.

A rough sketch of the DSIR importance weight for document \(x\):

\[w(x) = \frac{p_{\text{target}}(x)}{p_{\text{web}}(x)}\]

In practice both densities are approximated using hashed n-gram feature counts, making the ratio tractable without a neural forward pass. KL reduction - the decrease in KL divergence between the sampled corpus and the target - serves as a downstream-correlated quality metric.

LLM-Assisted Annotation for Richer Labels

A newer pattern, exemplified by FineWeb-Edu (Penedo et al., 2024), extends the binary reference-corpus framing into a multi-score regime. Instead of "Wikipedia vs. web", an LLM (typically a capable chat model) is prompted to rate each document on a fine-grained scale - for example, 0 to 5 for educational value. A small classifier (such as a fine-tuned DistilBERT or a gradient-boosted model on text features) is then trained to replicate the LLM's scores across the full corpus, because running the LLM on every document is prohibitively expensive.

The result is a classifier that approximates a much richer quality signal than binary labels allow. FineWeb-Edu filtered 15 trillion FineWeb tokens down to 1.3 trillion high-educational-value tokens; models trained on the filtered subset showed substantially improved performance on knowledge-intensive benchmarks like MMLU and ARC.

The practical pipeline looks like this:

1. Sample ~400k documents from the web corpus
2. Score each with a capable LLM (e.g. GPT-4 or Claude, scoring 0-5)
3. Train a small fast classifier on these (doc_features -> quality_score)
4. Run the fast classifier over all 15T tokens
5. Retain documents scoring >= threshold (e.g. >= 3)

The LLM annotation step is a one-time cost; the small classifier amortises it across the full corpus.

When It Falls Down

Reference corpus bias. Wikipedia skews toward encyclopaedic prose, formal register, and English-dominant content. A classifier trained against it will penalise informal technical writing, code-heavy documents, non-Western cultural references, and low-resource languages even when those documents are genuinely high quality. The filter encodes the biases of whoever curates the positive set.

Distribution shift in the tail. The classifier is trained on a sample of the web. For highly unusual document types (legal contracts, poetry, source code interleaved with prose), its predictions are extrapolations, not interpolations. Hard thresholds applied to out-of-distribution content produce unpredictable retention rates.

Quality and relevance are not the same. A polished marketing brochure may score highly on fluency and coherence while contributing nothing useful to an LLM's world model. Conversely, a rough but accurate description of a niche technical process may score low. Classifiers trained on surface-level quality signals systematically under-retain specialised knowledge.

Score instability under deduplication ordering. In pipelines where deduplication runs before filtering, near-duplicate clusters tend to produce similar scores; all copies survive or all get dropped together. Run after deduplication, a document's survival may depend on which representative of its cluster happened to be retained. The ordering matters and is rarely documented.

Threshold sensitivity. A 10-percentile shift in the hard cutoff can change the retained corpus size by 20-40% and noticeably alter downstream benchmark distributions. Most published pipelines report a single threshold chosen post-hoc against held-out benchmarks, without sensitivity analysis. Treating the threshold as a hyperparameter with principled selection (using a small language model trained on the filtered data as a proxy) is better practice but rarely done.

The Core Idea: Reference Corpora as Proxy Labels

Lightweight Architectures Matter at Scale

Scoring Strategies: Hard Cutoff vs. Soft Resampling

LLM-Assisted Annotation for Richer Labels

When It Falls Down

Further Reading