Architectures & Scaling
MinHash and LSH for Deduplication
MinHash estimates Jaccard similarity between documents in constant memory, and LSH bucketing turns that estimate into a sub-linear nearest-neighbour search, making corpus-scale near-deduplication tractable.
intermediate · 8 min read · Premium
CommonCrawl's April 2023 snapshot contains roughly 3.1 billion web pages. A naive pairwise similarity check against even a million-document subset requires trillions of comparisons. That arithmetic is why every serious LLM pretraining pipeline - from RefinedWeb to FineWeb to Dolma - uses MinHash combined with Locality-Sensitive Hashing (LSH) to reduce that quadratic problem to something closer to linear.
The consequences of skipping this step are well-documented. Lee et al. (2021) showed that language models trained on deduplicated data memorise less verbatim content and can reach comparable perplexity with fewer training tokens, because the optimiser is not wasting gradient steps re-learning content it has already seen hundreds of times.
Why exact deduplication is not enough
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.