Code Data Curation

Roughly 86 % of GitHub repositories are forks, and a substantial fraction of the remainder contain auto-generated boilerplate, minified JavaScript, and vendored dependencies. Feed that raw crawl into a language model and you train primarily on noise. The Stack v2 (used for StarCoder2) grew to 4× the size of its predecessor not by relaxing quality standards but by finding better signals for what "quality" means in code. The discipline that bridges raw crawl and a useful pretraining corpus is code data curation, and its decisions propagate directly into downstream coding ability.

Extraction and Language Detection

The first bottleneck is getting source code out of large web crawls or version-control archives in a reproducible way. Software Heritage provides persistent, content-addressed snapshots of public repositories under stable identifiers (SWHIDs), which StarCoder2 adopted to make its corpus fully auditable. Common alternatives start from GitHub archive dumps, BigQuery public datasets, or Google's Common Crawl filtered for code-like MIME types.

Once files are obtained, language identification is non-trivial. File extensions cover the common cases (.py, .rs, .java) but miss polyglot files, template languages, and configuration DSLs. Downstream tools typically combine extension heuristics with libraries like linguist (GitHub's language detector) and a byte-level trigram classifier as a fallback.

A useful early filter is file size. Files below roughly 100 bytes are usually empty or stubs; files above a few megabytes are usually minified assets or auto-generated serialised data. Both categories have near-zero signal-to-noise ratio and are cheap to remove:

MIN_BYTES = 100
MAX_BYTES = 1_048_576  # 1 MiB

def passes_size_filter(content: bytes) -> bool:
    return MIN_BYTES <= len(content) <= MAX_BYTES

Quality Filtering

Quality filtering for code differs structurally from natural-language filtering because "quality" is partly objective: code either compiles or it does not, has well-formed syntax or it does not, follows a consistent style or it does not.

Heuristic filters that the BigCode and DeepSeek-Coder teams converged on include:

Signal	Typical threshold	Rationale
Average line length	< 100 chars	Minified JS/CSS has very long lines
Alphabetic character ratio	> 0.25	Rejects encoded binary blobs
XML/HTML fraction	< 0.2	Rejects data disguised as code
Number of lines	5 to 100 000	Removes stubs and huge generated files
Comment-to-code ratio	configurable	Too high = template; too low = obfuscated

Star-count filtering is an attractive proxy for quality (popular repositories presumably contain better code) but is empirically unreliable. SantaCoder found that restricting to repositories with 5+ GitHub stars degraded performance, likely because star counts correlate with project age and novelty rather than code quality, and because obscure but well-written codebases are eliminated. This is a consistent lesson: social signals are weak quality proxies in code.

Extraction and Language Detection

Quality Filtering

Keep reading with Pro.