← Concept library

Architectures & Scaling

Code Data Curation

A systematic account of how raw source code from the internet is transformed into a deduplicated, filtered, mixed, and decontaminated pretraining corpus for code-focused language models.

advanced · 8 min read · Premium

Roughly 86 % of GitHub repositories are forks, and a substantial fraction of the remainder contain auto-generated boilerplate, minified JavaScript, and vendored dependencies. Feed that raw crawl into a language model and you train primarily on noise. The Stack v2 (used for StarCoder2) grew to 4× the size of its predecessor not by relaxing quality standards but by finding better signals for what "quality" means in code. The discipline that bridges raw crawl and a useful pretraining corpus is code data curation, and its decisions propagate directly into downstream coding ability.

Extraction and Language Detection

The first bottleneck is getting source code out of large web crawls or version-control archives in a reproducible way. Software Heritage provides persistent, content-addressed snapshots of public repositories under stable identifiers (SWHIDs), which StarCoder2 adopted to make its corpus fully auditable. Common alternatives start from GitHub archive dumps, BigQuery public datasets, or Google's Common Crawl filtered for code-like MIME types.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied