Data Pipelines at Scale

Roughly 99% of the raw bytes scraped from a CommonCrawl snapshot never make it into a model's training set. The Falcon team extracted five trillion tokens from CommonCrawl but publicly released only 600 billion after filtering (Penedo et al., 2023). That 8-to-1 discard ratio is not waste; it is the core engineering problem of pretraining data.

From the Crawl to Clean Text

Every large-scale pretraining pipeline starts from the same commodity source: Common Crawl's petabyte-scale WARC archives, updated monthly since 2008. The raw bytes are HTML-wrapped boilerplate, encoded in at least a dozen character sets, and riddled with duplicate near-duplicates of the same article published across thousands of mirror sites.

The extraction stage transforms WARCs into plain Unicode text. trafilatura, resiliparse, and custom rule-based extractors strip nav bars, cookie banners, and boilerplate markup. Language identification (typically fastText LangDetect) follows immediately, because later quality heuristics are language-specific.

Quality filtering then applies a cascade of cheap signal checks before any expensive model inference:

Filter type	Typical signal	Effect
Length	Fewer than 50 words	Drop short fragments
Character ratio	Non-alphanumeric > 20%	Catch encoding artefacts
Perplexity	KenLM score above threshold	Remove incoherent text
Stop-word density	Very low ratio	Catch keyword-stuffed pages
Line deduplication	Repeated lines within doc	Remove footer/nav artefacts

Perplexity filtering with a small n-gram LM (a KenLM 5-gram trained on Wikipedia) is especially powerful: a document whose perplexity far exceeds the distribution learned from clean prose almost certainly contains garbled OCR, machine-translated spam, or auto-generated affiliate content.

Deduplication: Why One Copy Is Enough

Identical or near-identical content at scale causes two separate harms. First, a model repeatedly seeing the same passage memorises it verbatim; Lee et al. (2022) showed that models trained on deduplicated C4 emitted memorised text ten times less frequently and needed fewer steps to reach the same perplexity. Second, test-set contamination (covered below) becomes harder to audit when the corpus is full of near-duplicates that differ only in whitespace.

Two algorithms dominate at scale:

MinHash LSH (Locality-Sensitive Hashing). Each document is represented as a set of n-gram shingles; MinHash sketches approximate Jaccard similarity. Documents exceeding a similarity threshold (commonly 0.8) are collapsed to a single representative. MinHash scales near-linearly and works well for paragraph-level fuzzy duplicates.

Suffix-array exact deduplication. The entire corpus is concatenated and sorted as a suffix array. Shared substrings of length above some threshold (e.g., 100 tokens) are identified and the duplicate spans removed from all but one document. This catches exact repeated passages that differ in surrounding context, which MinHash misses.

Both are applied in the RefinedWeb pipeline. The choice of threshold matters: an overly aggressive Jaccard cutoff can delete legitimately similar news summaries; too loose, and you leave near-duplicates intact.

From the Crawl to Clean Text

Deduplication: Why One Copy Is Enough

Keep reading with Pro.