Toxicity Filtering of Pretraining Data

About 15% of a naively filtered CommonCrawl snapshot contains text a reasonable person would flag as sexually explicit, violent, or hateful. Train an LLM on that fraction unaltered and you hand the model a free lesson in generating the very content you later try to suppress with fine-tuning and RLHF. Worse, the contamination is not uniformly distributed: toxic content clusters around certain topics, certain communities, and certain kinds of web pages, so leaving it in introduces biases that skew downstream behaviour in ways that are hard to audit.

Toxicity filtering is the pipeline stage that catches and removes this material before the model ever sees it. It sits downstream of language identification and quality filtering, and upstream of deduplication. Getting it wrong in either direction costs you, whether that is degraded safety or, as Dodge et al. (2021) showed, a corpus from which minority-community voices have been over-removed.

What Counts as "Toxic" in This Context

The practical definition is narrower than the philosophical one. For pretraining data curation, toxic content is typically operationalised across a small number of dimensions:

Category	Examples
Hate speech	Slurs, dehumanising content targeting protected groups
Sexually explicit	Pornography, non-consensual depictions
Violence / gore	Graphic descriptions of physical harm
Harassment	Sustained personal attacks, doxxing
Self-harm / suicide	Instructional content, glorification

Not every dirty word falls into these categories. Profanity in fiction, clinical descriptions of violence, or academic analysis of extremist rhetoric may contain flagged tokens but are not themselves harmful. This is the core tension the filtering stack has to resolve.

Three Filtering Strategies

1. URL and Domain Blocklists

The cheapest approach is never to download certain pages at all. The UT1 blocklist (maintained by the Toulouse Computer Science Laboratory) categorises millions of domains into topics including pornography, violence, and hate. FineWeb (Penedo et al., 2024) applies this URL-level filtering as its primary adult-content control.

Blocklists are fast, deterministic, and require no GPU budget. The failure modes are obvious: lists go stale, they offer no coverage of toxic content hosted on benign-category domains (Reddit, Twitter, personal blogs, news comment sections), and they do nothing about text extracted from PDFs or other non-HTML sources.

2. Word and N-gram Blocklists

C4, the dataset underlying T5 (Raffel et al., 2020), uses a list of "dirty, naughty, obscene or otherwise bad words" to discard any document containing even one match. The list circulates informally in the community; it contains a few hundred tokens including slurs, explicit sexual terms, and a handful of drug-related terms.

# pseudocode for C4-style blocklist filter
BLOCKLIST = load_word_list("ldnoobw.txt")   # ~400 tokens

def keep_document(text: str) -> bool:
    tokens = text.lower().split()
    return not any(tok in BLOCKLIST for tok in tokens)

Recall is high: almost no genuinely graphic content survives. Precision is low: Dodge et al. (2021) demonstrated that applying this filter to C4 disproportionately removed documents discussing LGBTQ+ topics, African American Vernacular English, and HIV/AIDS, because the blocklist contains words that appear legitimately in these communities' own discourse. The filter is fast but blunt.

3. Toxicity Classifiers

The third approach trains a model to score documents on toxicity dimensions rather than matching fixed tokens. Google's Perspective API (operated by Jigsaw) is the most widely cited example: it returns probability scores for toxicity, severe toxicity, obscenity, identity attack, insult, and threat. Several large training pipelines used Perspective scores as a filter signal, dropping documents above a threshold (commonly 0.7 on the toxicity attribute).

The Dolma corpus (Soldaini et al., 2024) applies classifier-based toxicity filtering explicitly, combining rule-based heuristics with classifiers trained on annotated hate speech and explicit content datasets. This gives finer-grained control than a word list: a document can contain slurs in an analytical context and still survive if the classifier scores it below the threshold.

Classifier-based filtering generalises better than blocklists, but introduces three new problems. First, the classifier reflects its training labels, which were produced by human annotators; those annotators bring demographic and cultural perspectives that affect what they labelled toxic. Second, threshold selection is a genuine hyperparameter: at threshold 0.5, a classifier might remove 8% of documents; at 0.8, only 1.5%. The difference is millions of tokens at scale. Third, classifiers fail on languages and dialects underrepresented in their training data.

A schematic of where toxicity filtering slots into the broader pipeline:

Raw CommonCrawl dump
    -> URL/domain blocklist          (domain-level, pre-download)
    -> Text extraction
    -> Language identification
    -> Quality filters (perplexity, length, repetition)
    -> [Toxicity filtering]
         - URL blocklist (second pass, for missed domains)
         - Word/n-gram blocklist
         - Toxicity classifier scoring
    -> Deduplication
    -> Final corpus

The Equity Dimension

Blocklist-based filtering has an uncomfortable property: marginalised groups often discuss their own marginalisation using the same vocabulary that appears in hate speech directed at them. Reclaimed slurs, community-specific vernacular, and first-person accounts of discrimination all trip word-list filters.

The Dodge et al. (2021) audit of C4 found that documents mentioning gay, lesbian, or transgender topics were removed at roughly twice the rate of gender-neutral equivalents after blocklist filtering. Documents in African American Vernacular English were similarly over-removed. This is not a minor statistical quirk; it systematically underrepresents these communities in the training corpus, which in turn affects how well the trained model handles queries from or about those communities.

Classifier-based filters can replicate and sometimes amplify this disparity if the underlying training data for the classifier is not carefully balanced. Equity-aware evaluation of toxicity filters requires measuring removal rates across community-linked topics, not just averaging across the full corpus.

When It Falls Down

False negatives at scale. FineWeb openly acknowledges that "a significant number of documents present in the final dataset could be considered to be toxic." At 15 trillion tokens, even a filter operating at 99.9% recall leaves tens of billions of tokens of harmful content. Toxicity filtering reduces the problem; it does not solve it.

Cross-lingual brittleness. Most classifiers and word lists are English-centric. Toxic content in Hindi, Arabic, or Yoruba often passes undetected, producing multilingual corpora with uneven safety properties across languages.

Code-switching and obfuscation. Deliberate evasion (h4t3, "hate but as a joke", leet-speak substitutions) fools both word-list and classifier approaches. Adversarial filtering evaluation shows that simple character substitutions suffice to defeat most deployed filters.

Threshold instability. Classifier thresholds chosen on validation sets during pipeline development may shift in their coverage as CommonCrawl dumps evolve over time (newer dumps have more social media content, different community norms, higher baseline toxicity rates). A threshold tuned on 2020 data may over- or under-filter 2024 data.

Domain erasure. Aggressive toxicity filtering has been shown to remove legal documents, clinical trials, and academic papers on sensitive topics, because content about violence, sexual abuse, or hate crimes in analytical and professional contexts matches the same surface features as genuinely harmful material.

Interaction with deduplication. Deduplication runs after toxicity filtering in most pipelines. A toxic document that survives filtering may be duplicated thousands of times, concentrating its influence far beyond what a document-level filter intended to allow.