Web-Scale Corpus Construction

Training GPT-3 consumed roughly 570 GB of filtered text after starting from nearly one trillion tokens of raw crawl data. That ratio - perhaps 40-50% thrown away before a single gradient step - is not waste. It is the job. The quality of a pretraining corpus is arguably the strongest single lever on downstream model capability, yet it receives far less systematic attention than architecture or optimisation.

This concept walks through the five major stages of corpus construction at web scale: extraction, quality filtering, deduplication, mixture design, and decontamination. Tokeniser training is addressed as a downstream consequence of the corpus, not a separate pipeline.

Extraction: from raw crawl to usable text

Common Crawl is the canonical starting point. It archives petabytes of WARC (Web ARChive) files monthly; a single crawl snapshot contains tens of billions of web pages. The extraction step converts HTML into clean, structured plain text.

The dominant tool is trafilatura (and its predecessors jusText, newspaper3k). These libraries strip boilerplate - navigation bars, cookie banners, footer links - and retain the main body text. The challenge is that "main body" is heuristic. A single HTML page might interleave a 200-word article with 1,000 words of sidebar ads; no extractor gets this right universally.

Language identification follows immediately. fastText's language ID model (lid.176.bin) classifies each document into one of 176 languages with high throughput. For English-only corpora, documents falling below roughly 0.65 confidence are discarded. For multilingual corpora the threshold becomes a policy decision per language, because lower-resource languages have noisier crawl coverage.

At this point, the raw corpus is still enormous and largely unusable. Extraction reduces token count by a factor of three to five; the next stages reduce it further.

Quality Filtering

No single definition of "quality" exists. In practice, filtering pipelines combine several families of signal:

Heuristic rules operate on surface statistics. Common examples:

Rule	Rationale
Discard if `<` 100 words	Stub pages, error pages
Discard if `>` 30% punctuation	Encoded binary, markup residue
Discard if lines starting with `#` `>` 90%	Code/config files mistaken for prose
Discard if mean word length `<` 3 or `>` 10	Tokenisation artefacts
Discard if alphabetic character fraction `<` 0.7	Tables, SEO spam

The C4 dataset (used for T5 training) popularised an influential ruleset: keep only documents containing at least three sentences ending in a terminal punctuation mark, discard any document with the phrase "lorem ipsum", and so on. These rules are cheap and surprisingly effective at removing machine-generated spam.

Perplexity filtering scores each document against a small reference language model (typically a KenLM n-gram model trained on Wikipedia or a high-quality seed corpus). Documents with perplexity above a threshold - say, the 30th percentile of the training set - are discarded. The intuition is that very high perplexity relative to clean text signals garbled language. CCNet introduced this approach; it is standard in most modern pipelines including FineWeb.

Extraction: from raw crawl to usable text

Quality Filtering

Keep reading with Pro.