Architectures & Scaling
Quality Filtering with Classifiers
Classifier-based quality filtering uses lightweight models trained on curated reference corpora to score and discard low-quality web documents before LLM pretraining.
intermediate · 7 min read
Roughly 80% of the tokens in a raw Common Crawl dump are noise: scraped navigation menus, garbled OCR, templated boilerplate, and gibberish that no heuristic rule reliably catches. Heuristic filters help at the margins, but the hard problem is distinguishing a coherent, informative paragraph from one that merely passes surface-level checks. Classifier-based quality filtering addresses this directly: train a model to discriminate between high-quality reference text and raw web text, then use its scores as a continuous quality signal at pipeline scale.
The Core Idea: Reference Corpora as Proxy Labels
Every quality classifier needs a notion of "good". The standard construction borrows it from corpora that already went through human editorial processes: Wikipedia, books, or peer-reviewed papers. Documents from these sources become positive examples; documents sampled randomly from the web become negative examples. A classifier trained on this binary task learns to approximate the latent quality signal those sources embody.
CCNet (Wenzek et al., 2020) popularised this pattern. They trained a fastText language model on Wikipedia to estimate how likely a web document was to come from Wikipedia-quality writing, then bucketed documents by their perplexity scores - low perplexity (high confidence) documents were retained; high perplexity ones were discarded. The same principle underlies the quality filter in GPT-3's data pipeline: a binary classifier trained on WebText (Reddit upvoted links, a rough quality proxy) versus raw CommonCrawl, retaining only documents the classifier scored above a threshold.
The key insight is that you never directly label web documents as good or bad. You use editorial proxies, and the classifier generalises from those proxies to the rest of the web.
Lightweight Architectures Matter at Scale
Processing trillions of tokens means the classifier must be fast. Two families dominate:
| Family | Example | Throughput | Typical use |
|---|---|---|---|
| Bag-of-n-grams (logistic regression / fastText) | CCNet perplexity filter | Millions of docs/minute on CPU | First-pass, language-aware filtering |
| Small neural classifier | BERT-base trained on quality labels | ~50k docs/min on GPU | Second-pass, higher precision |
| Importance resampling (DSIR) | Hashed n-gram features + KL weighting | 100M docs in ~4.5 hours | Distribution-matching to a target |
For a 15-trillion-token corpus, even a 20% reduction in processing speed compounds enormously. FastText and n-gram logistic classifiers run on CPU clusters with no GPU cost, making them the practical default for a first-pass filter. BERT-scale classifiers are reserved for high-value subsets where precision matters more than throughput.
Scoring Strategies: Hard Cutoff vs. Soft Resampling
Once you have per-document scores, you must decide how to use them. Two strategies exist:
Hard threshold filtering. Documents scoring below a fixed percentile are dropped. GPT-3's pipeline kept the top 60% by classifier score; CCNet partitioned into three quality buckets (head, middle, tail) and used only the head. Simple, fast, but binary: a document scoring at the 59th percentile gets treated identically to one scoring at the 1st.
Importance resampling (DSIR). Rather than a hard cut, DSIR (Xie et al., NeurIPS 2023) assigns each document an importance weight proportional to how much it increases the similarity between the training distribution and a target distribution (Wikipedia + books). Documents are then sampled proportionally to their weights. This preserves more of the web's diversity while biasing the mixture toward the target distribution, and avoids the cliff-edge behaviour of hard thresholds.
A rough sketch of the DSIR importance weight for document \(x\):
\[w(x) = \frac{p_{\text{target}}(x)}{p_{\text{web}}(x)}\]In practice both densities are approximated using hashed n-gram feature counts, making the ratio tractable without a neural forward pass. KL reduction - the decrease in KL divergence between the sampled corpus and the target - serves as a downstream-correlated quality metric.
LLM-Assisted Annotation for Richer Labels
A newer pattern, exemplified by FineWeb-Edu (Penedo et al., 2024), extends the binary reference-corpus framing into a multi-score regime. Instead of "Wikipedia vs. web", an LLM (typically a capable chat model) is prompted to rate each document on a fine-grained scale - for example, 0 to 5 for educational value. A small classifier (such as a fine-tuned DistilBERT or a gradient-boosted model on text features) is then trained to replicate the LLM's scores across the full corpus, because running the LLM on every document is prohibitively expensive.
The result is a classifier that approximates a much richer quality signal than binary labels allow. FineWeb-Edu filtered 15 trillion FineWeb tokens down to 1.3 trillion high-educational-value tokens; models trained on the filtered subset showed substantially improved performance on knowledge-intensive benchmarks like MMLU and ARC.
The practical pipeline looks like this:
1. Sample ~400k documents from the web corpus
2. Score each with a capable LLM (e.g. GPT-4 or Claude, scoring 0-5)
3. Train a small fast classifier on these (doc_features -> quality_score)
4. Run the fast classifier over all 15T tokens
5. Retain documents scoring >= threshold (e.g. >= 3)
The LLM annotation step is a one-time cost; the small classifier amortises it across the full corpus.
When It Falls Down
Reference corpus bias. Wikipedia skews toward encyclopaedic prose, formal register, and English-dominant content. A classifier trained against it will penalise informal technical writing, code-heavy documents, non-Western cultural references, and low-resource languages even when those documents are genuinely high quality. The filter encodes the biases of whoever curates the positive set.
Distribution shift in the tail. The classifier is trained on a sample of the web. For highly unusual document types (legal contracts, poetry, source code interleaved with prose), its predictions are extrapolations, not interpolations. Hard thresholds applied to out-of-distribution content produce unpredictable retention rates.
Quality and relevance are not the same. A polished marketing brochure may score highly on fluency and coherence while contributing nothing useful to an LLM's world model. Conversely, a rough but accurate description of a niche technical process may score low. Classifiers trained on surface-level quality signals systematically under-retain specialised knowledge.
Score instability under deduplication ordering. In pipelines where deduplication runs before filtering, near-duplicate clusters tend to produce similar scores; all copies survive or all get dropped together. Run after deduplication, a document's survival may depend on which representative of its cluster happened to be retained. The ordering matters and is rarely documented.
Threshold sensitivity. A 10-percentile shift in the hard cutoff can change the retained corpus size by 20-40% and noticeably alter downstream benchmark distributions. Most published pipelines report a single threshold chosen post-hoc against held-out benchmarks, without sensitivity analysis. Treating the threshold as a hyperparameter with principled selection (using a small language model trained on the filtered data as a proxy) is better practice but rarely done.
Further Reading
- Wenzek et al. (2020), "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data" - the foundational perplexity-filtering pipeline: https://arxiv.org/abs/1911.00359
- Xie et al. (2023), "Data Selection for Language Models via Importance Resampling" (DSIR, NeurIPS 2023) - principled distribution-matching alternative to hard thresholds: https://arxiv.org/abs/2302.03169
- Penedo et al. (2024), "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale" - LLM-annotated educational classifier and ablation study at 15T-token scale: https://arxiv.org/abs/2406.17557
- Penedo et al. (2023), "The RefinedWeb Dataset for Falcon LLM" - demonstrates web-only data can match curated corpora when filtered rigorously: https://arxiv.org/abs/2306.01116