Heuristic Quality Filters

Roughly 40% of tokens in a raw Common Crawl dump are not natural language in any useful sense: they are JavaScript fragments, cookie-consent boilerplate, auto-generated spam, repeated navigation menus, and pages where every line is shorter than a tweet. A well-designed heuristic filter pass eliminates the bulk of this noise in a single CPU-bound sweep, before you spend money training a quality classifier. Get this pass wrong and the classifier never recovers; get it right and your corpus improves dramatically with zero model inference cost.

What a heuristic filter actually is

A heuristic filter is a deterministic function f(doc) -> {keep, discard} that fires on document- or line-level statistics: character counts, token counts, symbol ratios, n-gram repetition rates, keyword presence. No model weights are involved. The logic is inspectable, reproducible, and fast enough to run across trillions of tokens on a modest cluster in hours rather than days.

The intuition is that quality text has a recognisable statistical signature. A well-written paragraph has sentences ending in punctuation, a vocabulary drawn from a natural language, a non-trivial mix of content words rather than repeated boilerplate phrases, and a length that encodes at least one coherent thought. Heuristics operationalise these intuitions as thresholds.

The canonical filter families

Document-length filters

The simplest gate: discard any document below a minimum word count or sentence count. The C4 dataset (the cleaned corpus behind T5) discards pages with fewer than three sentences and removes any line under five words. These thresholds sound arbitrary but they cut a large fraction of crawl noise that consists of single-sentence stub pages, error messages, and navigation menus mistakenly extracted as documents.

A complementary upper bound is less common but occasionally useful: documents that are anomalously long (millions of tokens) are often database dumps, log files, or concatenated PDFs with broken extraction.

Character- and symbol-ratio filters

Natural text has a bounded ratio of non-alphanumeric characters to total characters. Heavily SEO-spammed pages, auto-generated product listings, and raw HTML bleed-through push this ratio high. A common threshold is: discard documents where the fraction of non-alphabetic characters exceeds roughly 0.25-0.30.

Related filters target:
- Bullet/special-character density: pages where more than a threshold fraction of lines begin with •, |, #, or similar symbols are likely navigation menus or structured tables rather than prose.
- Digit ratio: code, log files, and financial tables show a high ratio of digit characters to total characters.

Line-terminal-punctuation filter

The most debated single rule in corpus curation. C4 keeps only lines that end with a terminal punctuation mark (period, exclamation mark, question mark, or closing quotation mark). This aggressively eliminates navigation links, list items, and code fragments.

The FineWeb team (Hugging Face, 2024) ran ablation studies across over 50 candidate statistics and found that the terminal-punctuation filter, while effective in isolation, removed around 30% of data when applied globally. They chose not to include it in their pipeline for that reason, accepting some noise in exchange for coverage. This is a canonical example of the precision-recall trade-off in heuristic design: stricter rules get cleaner text but discard real content at the margins.

Repetition filters

Low-information pages often contain repeated phrases or duplicate lines: cookie banners copied across every page of a site, auto-generated product descriptions looping the same adjectives, forum threads where a quote is pasted in full before each short reply.

The FineWeb paper (arxiv.org/abs/2406.17557) identified two highly effective repetition rules:

Rule	Threshold used	% tokens removed
Fraction of characters in duplicated lines	>= 0.10	~12.5%
Fraction of lines shorter than 30 characters	>= 0.67	~3.7%
Fraction of lines ending with punctuation	<= 0.12	~10.1%

The duplicated-line-character ratio is particularly powerful because it catches not just exact duplicates but near-duplicate boilerplate that appears once per section.

Keyword and pattern filters

Some signals are categorical rather than statistical:

Lorem ipsum: presence of the standard placeholder text is a reliable indicator of a staging site or template page never replaced with real content.
Curly braces: pages containing { or } at high frequency are often templates, code, or JSON bleed-through. C4 uses this heuristic to avoid including code in a prose corpus; if you are building a code-inclusive corpus you would invert it.
Policy boilerplate: lines containing "cookie policy", "privacy policy", "terms of use", or "all rights reserved" are near-certain navigation/footer text. Removing them at the line level rather than discarding the whole document is a more surgical approach.
Bad-word lists: the C4 pipeline removes any document containing words from a publicly released list of offensive terms. This is a blunt instrument; it will discard news articles covering hate crimes as readily as it discards actual slurs. More refined approaches do sentence-level toxicity scoring instead.

Language identification

langdetect or fastText-based language identification is often grouped with heuristic filters because it is a statistical classifier trained separately and run as a lookup rather than as part of the main quality model. C4 retains only documents where the detector assigns English with probability >= 0.99. For multilingual corpora the same logic applies per-language, with thresholds relaxed for low-resource languages where overconfident filtering would leave too little data.

Tuning thresholds: the practical approach

The FineWeb methodology is instructive. Rather than guessing thresholds:

Compute a large battery of per-document statistics over a deduplicated sample.
Compare the distribution of each statistic between a "high quality" reference set (Wikipedia, books) and the raw crawl.
Find the inflection point in the histogram where the quality-positive and quality-negative distributions diverge, and set the threshold there.
Measure what fraction of tokens each rule removes. Any single rule removing more than roughly 15-20% of the corpus warrants extra scrutiny; it may be accurate but costly in coverage, or it may be over-triggering on legitimate content.

This is empirical but principled: you are not inventing thresholds from intuition, you are reading them from the data.

# Simplified pseudo-code for a threshold-calibration loop
for stat_name, compute_fn in candidate_stats:
    values_hq = [compute_fn(doc) for doc in high_quality_sample]
    values_raw = [compute_fn(doc) for doc in raw_crawl_sample]
    threshold = find_inflection(values_hq, values_raw)
    removal_rate = fraction_removed(raw_crawl, stat_name, threshold)
    if removal_rate < MAX_ACCEPTABLE_LOSS:
        pipeline.add_filter(stat_name, threshold)

When it falls down

Domain-inappropriate filters. A terminal-punctuation rule trained on news prose will aggressively discard code documentation, mathematical writing with display equations, and dialogue-heavy text. Every heuristic embeds assumptions about what "good text" looks like, and those assumptions are genre-specific.

Language mismatch. Filters calibrated on English Common Crawl perform poorly on morphologically rich languages (Turkish, Finnish) where word lengths and symbol ratios differ, and on code-switching documents that mix languages.

Adversarial content. Spam pages designed to pass quality classifiers often also pass heuristic filters: they use well-formed sentences with terminal punctuation and acceptable symbol ratios, they simply say nothing. Heuristic filters are first-pass noise removal, not semantic quality assessment.

Correlated false positives. Multiple filters compounding on the same corpus can have surprising joint removal rates. If filter A removes 10% of documents and filter B removes 12%, and they are correlated (both fire on short pages), their union might remove only 14% rather than the naive 22%. Measuring filter overlap is essential before deploying a multi-rule pipeline.

Threshold brittleness across crawl snapshots. Common Crawl content distribution shifts over time. A threshold calibrated on a 2020 crawl may over-trigger on a 2024 crawl where the web itself has changed (more JavaScript-heavy SPAs, different boilerplate patterns). Pipelines that do not re-calibrate across snapshots risk quietly degrading corpus quality.