Language Identification and Filtering

A 2019 Common Crawl dump spans roughly 2.5 billion web pages. At least 40% of those pages are not in English. If you are training an English-only model and skip language filtering, you will inject noise that no perplexity filter can cleanly undo, because a French or Russian document may score perfectly well under an English n-gram model applied naively. Language identification is therefore not a courtesy step; it is the first hard gate in any serious pretraining pipeline.

What Language Identification Actually Does

The task is deceptively simple to state: given a text string, return a language code such as en, zh-Hans, or pt. In practice, the classifier must handle documents that are mostly one language but contain code, URLs, boilerplate navigation text in a different language, or transliterated proper nouns. Three approaches dominate production pipelines:

Character n-gram models (fastText langdetect). Facebook Research's fastText lid model covers 176 languages and fits in under 1 MB compressed. It operates on character n-grams hashed into a fixed embedding table, which makes it extremely fast (millions of documents per second on a single CPU) and robust to out-of-vocabulary words. The model outputs a label and a confidence score.

Byte-pair encoding baselines. Some pipelines use the ratio of bytes surviving a language-specific BPE vocabulary as a proxy for language purity. If only 30% of a document's bytes appear in the English BPE vocabulary, it probably is not English, regardless of what langdetect says. This is crude but surprisingly effective as a second-pass filter.

Neural sequence classifiers. Projects like Lingua and GlotLID train transformer-based classifiers and claim higher accuracy on short texts and closely related language pairs, at the cost of an order-of-magnitude higher inference time. For corpora measured in trillions of tokens the compute overhead matters.

In practice, nearly every large-scale pipeline (CCNet, RefinedWeb, the Llama 2 data pipeline, DataComp-LM) uses fastText as the primary classifier and then applies a confidence threshold rather than taking the top-1 prediction unconditionally.

The Confidence Threshold Problem

FastText returns a score in (0, 1). A common choice is to keep documents where score > 0.65 for the target language. But this single number hides a lot of nuance:

Threshold	Effect on English corpus
0.5	Keeps many borderline cases; substantial non-English noise
0.65	Reasonable default; drops ~5-8% of borderline English pages
0.85	Much cleaner; but drops significant amounts of dialectal English, English with heavy code-switching, and legitimate bilingual pages
0.95	Very high precision; meaningful recall loss on informal web text

The right threshold depends on your target language and domain. For high-resource languages like English or German, a threshold around 0.65 is common. For low-resource languages like Swahili or Welsh, setting the same threshold will discard documents that are legitimately in the target language but happen to borrow many English words. CCNet addressed this by running language identification per paragraph rather than per document, then accepting documents where the dominant detected language is the target.

A practical pipeline sketch:

bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes

The k=1 call is important for throughput; requesting the full distribution doubles latency for no benefit when you only need the top prediction.

Multilingual Pipelines and Language Mixture Targets

Not every pretraining run is English-only. Models like mT5 and Bloom targeted many languages simultaneously, which turns language identification from a binary gate into a routing step: each document is labelled and placed into a per-language shard.

Once shards exist, a second problem appears: how much of each language to include in the training mixture. Naively mixing proportional to corpus size would give English perhaps 50-60% of all tokens and Swahili a fraction of a percent. But for multilingual models, some form of upsampling of low-resource languages is standard. The mT5 paper used a temperature-based sampling scheme where sampling probability for language \(l\) is:

\[p_l \propto \left(\frac{n_l}{N}\right)^{1/T}\]

where \(n_l\) is the number of tokens in language \(l\), \(N\) is the total token count across all languages, and \(T > 1\) is a temperature parameter (the mT5 paper used \(T = 5\)). Higher \(T\) flattens the distribution, boosting low-resource languages. The cost is that high-resource languages are undersampled relative to their natural occurrence.

This interacts directly with language identification quality: if your classifier is less accurate for a low-resource language, you will either undersample it (if you are too conservative on confidence) or contaminate it with mislabelled documents from a similar high-resource language (if you are too permissive).

When It Falls Down

Script-sharing languages. Serbian (sr) and Croatian (hr) are mutually intelligible and share the Latin and Cyrillic scripts. Bosnian (bs) and Montenegrin (cnr) overlap even further. No character n-gram model reliably separates them; even human annotators disagree on some documents. Any pipeline that reports distinct corpora for these languages should be treated with scepticism.

Code-heavy documents. A page that is 60% Python code and 40% English prose will often be misclassified as some other language because the classifier sees byte sequences unlike anything in its training distribution. The CCNet pipeline deliberately excluded documents below a short-text length threshold partly for this reason.

Transliteration. Japanese written in Romaji, Arabic written in informal Latin script (Arabizi), or Hindi written in English letters will fool any script-based component of a classifier and often fool n-gram models trained on standard script.

Domain shift at evaluation. Classifiers trained on Wikipedia, Tatoeba, and news corpora can misclassify casual forum text, tweets, or historically spelt English (common in digitised books). If your pipeline ingests Project Gutenberg or historical archives, expect non-trivial mislabelling.

Score gaming. Some web content is deliberately multilingual to attract search traffic: a page might embed a list of English keywords inside otherwise Chinese text. The document-level score will be confused. Paragraph-level detection (as in CCNet) partially mitigates this but adds pipeline complexity.

What Language Identification Actually Does

The Confidence Threshold Problem

Multilingual Pipelines and Language Mixture Targets

When It Falls Down

Further Reading