Architectures & Scaling
Web-Scale Corpus Construction
Building a pretraining corpus at web scale requires five tightly coupled stages - extraction, quality filtering, deduplication, source mixing, and decontamination - each of which silently determines what a model can and cannot know.
advanced · 9 min read · Premium
Training GPT-3 consumed roughly 570 GB of filtered text after starting from nearly one trillion tokens of raw crawl data. That ratio - perhaps 40-50% thrown away before a single gradient step - is not waste. It is the job. The quality of a pretraining corpus is arguably the strongest single lever on downstream model capability, yet it receives far less systematic attention than architecture or optimisation.
This concept walks through the five major stages of corpus construction at web scale: extraction, quality filtering, deduplication, mixture design, and decontamination. Tokeniser training is addressed as a downstream consequence of the corpus, not a separate pipeline.
Extraction: from raw crawl to usable text
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.