MinHash and LSH for Deduplication

CommonCrawl's April 2023 snapshot contains roughly 3.1 billion web pages. A naive pairwise similarity check against even a million-document subset requires trillions of comparisons. That arithmetic is why every serious LLM pretraining pipeline - from RefinedWeb to FineWeb to Dolma - uses MinHash combined with Locality-Sensitive Hashing (LSH) to reduce that quadratic problem to something closer to linear.

The consequences of skipping this step are well-documented. Lee et al. (2021) showed that language models trained on deduplicated data memorise less verbatim content and can reach comparable perplexity with fewer training tokens, because the optimiser is not wasting gradient steps re-learning content it has already seen hundreds of times.

Why exact deduplication is not enough

Exact URL or MD5-hash deduplication removes nothing but byte-for-byte copies. In practice, web data is littered with near-duplicates: the same news story scraped by fifty aggregators, a Wikipedia article mirrored across thousands of educational sites, a Stack Overflow answer republished with the ads stripped. These are textually distinct enough to pass exact deduplication, yet semantically redundant.

The right notion of similarity for this task is the Jaccard coefficient over token n-gram sets:

J(A, B) = |A ∩ B| / |A ∪ B|

Two documents with Jaccard similarity above a chosen threshold (commonly 0.7 to 0.8) are considered near-duplicates, and one of the pair is discarded. Computing this exactly for every pair is O(n²) in the number of documents, which is infeasible at pretraining scale.

MinHash: compressing a document into a fixed-length sketch

MinHash, due to Broder (1997), reduces each document to a short integer vector called a signature while preserving Jaccard similarity in expectation.

The core idea. Represent a document as the set of its k-gram shingles (e.g., character 5-grams or word unigrams/bigrams). Apply h independent hash functions to each shingle, and record the minimum hash value produced by each function across all shingles in the document. The resulting vector of h minimum values is the MinHash signature.

The key property: the probability that two documents agree on the minimum value for any given hash function equals their Jaccard similarity.

P[ min_h(A) = min_h(B) ] = J(A, B)

So the fraction of positions where two signatures agree is an unbiased estimator of Jaccard similarity. With h = 128 or 256 hash functions, the estimation variance is small enough for practical deduplication thresholds.

Critically, signature generation is O(|document| × h) and the resulting signature is a fixed-length vector of h integers regardless of document length. You can represent the entire 3.1-billion-page CommonCrawl as a matrix of fixed-width rows.

A minimal Python sketch using the datasketch library:

```python
from datasketch import MinHash

def make_signature(text: str, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for shingle in get_shingles(text, k=5): # 5-char shingles
m.update(shingle.encode("utf-8"))
return m

Why exact deduplication is not enough

MinHash: compressing a document into a fixed-length sketch

Keep reading with Pro.