← Concept library

Architectures & Scaling

Document-Level vs Token-Level Deduplication

Document-level deduplication removes whole near-duplicate pages using hashing, while token-level deduplication removes repeated spans within and across documents using suffix arrays, each with distinct cost-quality trade-offs in web-scale corpus construction.

intermediate · 7 min read

A single 61-word English sentence appeared over 60,000 times in the C4 dataset. When a model trains on that sentence tens of thousands of times, it memorises it verbatim rather than learning anything from it. That is the problem deduplication solves, and the choice of which level to deduplicate at determines what kinds of redundancy you actually remove.

The two dominant approaches sit at opposite ends of a granularity spectrum. Document-level deduplication treats each document as an atomic unit and asks "have I seen something like this before?" Token-level deduplication operates on sequences of tokens inside documents and asks "have I seen this exact span before?" Both matter, and most production pipelines run some combination of the two.

Document-Level Deduplication: Fuzzy Hashing at Scale

The standard document-level tool is MinHash with Locality-Sensitive Hashing (LSH). The algorithm sketches each document as a set of character n-grams (commonly 5-grams), computes a MinHash signature by applying multiple independent hash functions, and partitions documents into buckets where highly similar pairs are likely to share at least one bucket. Two documents that share a configurable fraction of their MinHash bands are considered near-duplicates; one copy is kept and the rest discarded.

# Conceptual MinHash pipeline for a single document
shingles  = {text[i:i+5] for i in range(len(text)-4)}  # character 5-grams
signature = [min(h(s) for s in shingles) for h in hash_functions]
# Bucket by bands of the signature - documents in the same bucket are candidates

The FineWeb dataset (Penedo et al., 2024) applied MinHash with 112 hash functions split into 14 bands of 8, targeting documents at least 75% similar. One critical finding: global deduplication across all 96 Common Crawl snapshots hurt downstream model quality because it removed 90% of older-snapshot data while retaining proportionally more low-quality content from recent snapshots. Deduplicating each snapshot independently retained 20 trillion tokens and matched or beat globally-deduplicated baselines. The insight is that large clusters of thousands of identical documents are the real threat; shaving small clusters of 2-3 copies adds little and can remove genuinely useful re-occurrences of quality content.

Exact document hashing (MD5 or SHA-256 of the full text) is cheaper but rigid: it catches only bitwise-identical duplicates. Boilerplate text with minor variations, articles scraped with different header/footer templates, and forum posts with differing signatures all evade it.

Method Granularity Similarity Cost What it misses
Exact hash Document 100% Very low Near-duplicates with minor edits
MinHash + LSH Document Configurable (~70-90%) Moderate Intra-document repeated spans
Suffix array Token span Exact substring High (Rust/C++) Fuzzy near-duplicates

Token-Level Deduplication: Suffix Arrays and Repeated Spans

Token-level (or sequence-level) deduplication finds repeated contiguous spans across the entire corpus, not just repeated whole documents. The canonical implementation from Lee et al. (2022) builds a suffix array over the tokenised corpus - a sorted array of all suffixes - then does a linear scan to identify any suffix that shares a long prefix with its neighbour, which means that exact substring is repeated somewhere in the corpus.

# Conceptual suffix array scan
corpus_bytes = concatenate(all_documents, separator_token)
SA = build_suffix_array(corpus_bytes)          # O(n log n) or O(n)
for i in range(len(SA) - 1):
    lcp = longest_common_prefix(SA[i], SA[i+1])
    if lcp >= THRESHOLD:                        # e.g. 50 tokens
        mark_span(SA[i], lcp)                   # one copy kept, rest dropped

This catches redundancy that document-level methods miss entirely. A boilerplate legal disclaimer repeated verbatim in thousands of otherwise-unique documents will be deduplicated; each host document survives, but the repeated span inside it is trimmed or the document down-weighted. Lee et al. found that suffix-array deduplication on C4 reduced the rate of verbatim memorised outputs by roughly 10x compared to no deduplication.

The cost is high. Building a suffix array over hundreds of billions of tokens requires tens of gigabytes of RAM and significant wall-clock time. The Google Research implementation (available at github.com/google-research/deduplicate-text-datasets) is written in Rust for this reason, and even then it is run once on a deduplicated corpus, not as a streaming filter.

Why Both Levels Matter in Practice

Document-level MinHash is cheap enough to run as an early pipeline stage across petabyte-scale crawls. It eliminates the grossest redundancy: scraped duplicates, mirror sites, RSS feed re-posts. Token-level suffix deduplication is a later, more expensive pass that catches the structural repetitions document-level hashing never sees - standard disclaimers, templated content blocks, footer boilerplate, and copyright notices that appear across millions of different source documents.

The D4 paper (Tirumala et al., 2023) demonstrated that layering intelligent data selection on top of MinHash deduplication yielded 20% training efficiency gains at 6.7B parameter scale. The key point is that deduplication and data selection are complementary: deduplication is a necessary baseline, not a substitute for quality filtering.

A rough operational sequence in a production pipeline:

  1. URL normalisation and exact-URL deduplication (free).
  2. Per-snapshot MinHash deduplication (catches same-content multi-URL copies).
  3. Suffix array deduplication on the filtered remainder (catches intra-corpus spans).
  4. Quality filtering, language identification, and content removal (after deduplication, so filters operate on the cleaner set).

When It Falls Down

Multilingual corpora. MinHash on character n-grams works less reliably when the same content appears in translation. A Wikipedia article and its Spanish equivalent are not near-duplicates by character hashing but carry substantially overlapping knowledge. Neither document-level nor token-level deduplication addresses semantic redundancy.

Legitimate repetition. Some text genuinely benefits from repeated exposure: mathematical derivations, code syntax, genre-specific conventions. Aggressive deduplication can remove exactly the signal the model needs to learn consistent formatting. FineWeb's finding that per-snapshot (not global) deduplication performed better is partly explained by this: global deduplication was removing valid repetitions of quality text, not just noise.

Suffix array memory pressure. For a 1 trillion token corpus with 2-byte tokens, the suffix array alone requires roughly 16 TB of memory at 8 bytes per entry. Practical implementations shard the corpus and deduplicate within and across shards in multiple passes, which can miss cross-shard duplicates depending on implementation.

Temporal bias. When deduplicating chronological crawls, keeping "one copy" preferentially keeps whichever copy was indexed first. If older copies are lower quality (draft text, scraped before editorial revision), the retained copy may be worse than what was removed.

Decontamination is a separate problem. Deduplication removes redundancy in the training set. It does not remove benchmark evaluation examples that accidentally appear in the corpus. That requires explicit decontamination (exact match or near-match against held-out test sets) as a separate pipeline stage.

Further Reading

  • Lee et al. (2022), "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499) - the foundational study on suffix-array and MinHash deduplication effects on memorisation and validation loss.
  • Penedo et al. (2024), "The FineWeb Datasets" (arXiv:2406.17557) - detailed ablations of per-snapshot vs global MinHash deduplication at 15+ trillion token scale.
  • Tirumala et al. (2023), "D4: Improving LLM Pretraining via Document De-Duplication and Diversification" (arXiv:2308.12284) - shows deduplication combined with embedding-based selection outperforms naive filtering.
  • Google Research, deduplicate-text-datasets (github.com/google-research/deduplicate-text-datasets) - open Rust implementation of suffix-array exact deduplication used in the Lee et al. experiments.
Sign in to save and react.
Share Copied