Exact and Fuzzy Deduplication

A single 61-word English sentence appeared more than 60,000 times in the C4 dataset. When you train on that kind of repetition, the model does not generalise - it memorises, and then leaks. Lee et al. (ACL 2022) showed that after deduplication, models emitted memorised text ten times less frequently and reached the same validation accuracy in fewer steps. Deduplication is not a quality nicety; it is one of the highest-leverage preprocessing decisions you make for a pretraining corpus.

What counts as a duplicate

There are two usefully different regimes:

Type	Definition	Typical tool
Exact document	Two documents have identical byte sequences	SHA-256 hash equality
Exact substring	Long substrings (>50 tokens) appear verbatim across many docs	Suffix arrays
Near-duplicate (fuzzy)	Two documents are semantically similar but not byte-identical	MinHash + LSH

Exact duplicates arise from web crawls hitting mirror sites, syndication feeds, and boilerplate scrapes of the same page across multiple snapshots. Near-duplicates arise from slightly edited re-posts, scraped comment threads, and template-generated pages that differ only in dates or product IDs. Both are costly for different reasons: exact duplicates inflate the effective dataset size without adding signal; near-duplicates are harder to detect but contribute to a more insidious memorisation of paraphrased content.

Exact deduplication: hashing and suffix arrays

For document-level exact deduplication the procedure is straightforward:

for doc in corpus:
    h = SHA256(normalise(doc))
    if h in seen_hashes:
        discard doc
    else:
        seen_hashes.add(h)

Normalisation before hashing matters enormously. Lowercasing, Unicode normalisation (NFKC), collapsing whitespace, and stripping HTML noise all reduce false negatives where two semantically identical documents have superficial byte differences. A Bloom filter replaces the hash set when the corpus is large enough that a Python set exceeds RAM: at a 1% false-positive rate, a Bloom filter for 50 billion hashes fits in roughly 72 GiB.

For substring-level exact deduplication the standard approach is a suffix array over the concatenated corpus. You sort all suffixes, then scan for adjacent suffixes of length >= threshold that share the same string. Documents containing flagged substrings are either removed or the repeated spans are masked. This catches the "boilerplate paragraph" problem that document hashing misses - two documents that are mostly unique but share a repeated legal disclaimer or news credit line.

Fuzzy deduplication: MinHash and locality-sensitive hashing

Near-duplicate detection at corpus scale needs a sub-quadratic algorithm. Comparing all pairs of a billion documents is infeasible; the standard solution is MinHash + Locality-Sensitive Hashing (LSH).

Step 1 - Shingling. Convert each document into a set of overlapping n-grams (typically word 5-grams). The Jaccard similarity between two documents' shingle sets is a good proxy for textual overlap.

Step 2 - MinHash signature. For each of k independent hash functions h_1...h_k, record the minimum hash value over all shingles in the document. This produces a signature vector of length k. The key property is:

P[ min_h(A) == min_h(B) ] = |A ∩ B| / |A ∪ B|  =  Jaccard(A, B)

So the fraction of matching elements in two signatures is an unbiased estimator of Jaccard similarity. With k = 128, variance is low enough to work reliably.

Step 3 - LSH banding. Divide the signature into b bands of r rows each (k = b * r). Two documents are declared candidate duplicates if their signatures agree in at least one band (all r rows of that band match). The probability that two documents with Jaccard similarity s are flagged as candidates is:

P(candidate) = 1 - (1 - s^r)^b

Choosing b and r sets the detection threshold curve. With b=20 and r=5 (k=100), the curve has a steep transition around s=0.5: documents with Jaccard >= 0.8 are caught with >99% probability, while documents with Jaccard <= 0.3 are almost never flagged. Candidate pairs are then either discarded as duplicates or passed to an exact Jaccard check for confirmation.

The whole pipeline scales to hundreds of billions of documents because the LSH step requires only linear passes and a hash-table lookup, not pairwise comparison.

Union-Find and connected-component removal

Once you have candidate duplicate pairs, you need a decision rule about what to keep. The standard approach is to build a graph where documents are nodes and candidate pairs are edges, then compute connected components using a Union-Find (disjoint set) structure. From each component you keep one representative (typically the longest, or the earliest crawl date) and discard the rest.

This matters for corpora assembled from multiple snapshots of the same web crawl. A document might not be a direct near-duplicate of any single other document, but it is part of a chain A ~ B ~ C where A and C would not have been flagged pairwise. Connected-component removal handles this.

FineWeb (Penedo et al., 2024), which processes 96 Common Crawl snapshots into 15 trillion tokens, describes MinHash deduplication at this scale as one of the pivotal choices that made the dataset competitive with curated mixes.

When it falls down

Aggressive Jaccard thresholds over-deduplicate. A threshold of 0.5 on word 5-grams will flag two Wikipedia articles covering different aspects of the same topic as duplicates if they share a lot of factual sentences. Multilingual corpora are particularly vulnerable: a French and an English version of the same article might share enough cognates to be flagged by a carelessly chosen threshold.

Short documents break MinHash. The Jaccard estimate is unreliable when the shingle set is tiny. A 20-word sentence has maybe 16 5-grams; the signature variance is too high to be useful. Short texts (social media posts, code snippets) need a different similarity measure or a minimum-length filter applied before deduplication.

Normalisation inconsistency across pipeline stages. If the normalisation applied at hash time differs from the normalisation the tokeniser applies later, you will mis-deduplicate. A common pitfall is deduplicating on byte-level SHA-256 but then normalising again during tokenisation, so "duplicates" you kept end up identical in token space.

Train-test contamination is a separate problem. Deduplicating the training corpus against itself does not prevent test set leakage. Lee et al. found that over 4% of standard validation sets overlap with deduplicated training data. You need a separate decontamination pass that compares training documents against each benchmark you intend to report on.

Suffix array scalability. Suffix array construction over a concatenated multi-terabyte corpus requires careful engineering - the naive O(n log n) sort uses more RAM than you have. The implementation in Lee et al. concatenates documents with sentinel characters and sorts with a cache-efficient algorithm, but it still requires on the order of 5 bytes per character of corpus text in working memory, which is non-trivial at trillion-token scale.

What counts as a duplicate

Exact deduplication: hashing and suffix arrays

Fuzzy deduplication: MinHash and locality-sensitive hashing

Union-Find and connected-component removal

When it falls down

Further reading