← Concept library

Architectures & Scaling

Data Pipelines at Scale

Building a pretraining corpus requires extracting, filtering, deduplicating, and mixing hundreds of billions of tokens from heterogeneous sources while keeping benchmark contamination out.

advanced · 8 min read · Premium

Roughly 99% of the raw bytes scraped from a CommonCrawl snapshot never make it into a model's training set. The Falcon team extracted five trillion tokens from CommonCrawl but publicly released only 600 billion after filtering (Penedo et al., 2023). That 8-to-1 discard ratio is not waste; it is the core engineering problem of pretraining data.

From the Crawl to Clean Text

Every large-scale pretraining pipeline starts from the same commodity source: Common Crawl's petabyte-scale WARC archives, updated monthly since 2008. The raw bytes are HTML-wrapped boilerplate, encoded in at least a dozen character sets, and riddled with duplicate near-duplicates of the same article published across thousands of mirror sites.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied