Deduplication and Memorisation

A 61-word English sentence appeared over 60,000 times in C4, the dataset underlying many T5-family models. Every one of those repetitions taught the model that this specific string was worth committing to memory. The consequence: a trained model will reproduce it verbatim on demand, regardless of whether that repetition carries any semantic signal. This is the deduplication-memorisation nexus in miniature.

What memorisation actually means in this context

A language model "memorises" a training string if, given a short prefix of that string as a prompt, the model completes it exactly. Carlini et al. (2021) formalised this as k-eidetic memorisation: a string is k-eidetically memorised if the model reproduces it exactly and it appears at most k times in the corpus. Low k (even k=1) memorisation is troubling because it proves the model has overfitted to a single document rather than learning a general pattern.

Three factors drive how much a string gets memorised:

Factor	Effect on memorisation
Number of duplicates	Log-linear: each doubling of repetition count roughly doubles extraction probability
Model size	Larger models memorise more absolute text, not less
Context length at inference	Longer prompts surface more memorised completions

The direction of the model-size finding is counterintuitive. A GPT-3-scale model does not "spread" memorisation thinly across more parameters; it has enough capacity to memorise a larger fraction of the training corpus outright. Carlini et al. (2022) measured this log-linear relationship across six orders of magnitude of model scale and found no saturation.

Why duplicates accumulate

Web-scraped corpora inherit duplication from the web itself. Several sources compound this:

Boilerplate propagation. Legal footers, navigation menus, and cookie-consent banners are copied to millions of pages verbatim.
Mirror sites and content syndication. A single article from a major publication often exists on dozens of aggregators.
Near-identical templates. E-commerce product descriptions, job postings, and real-estate listings share 90%+ of their tokens across thousands of documents.
Crawl overlap across time. Common Crawl takes monthly snapshots; the same page crawled in January and again in March yields two near-identical documents that survive independent quality filters.

When a multi-epoch training strategy is used (or when documents are repeated across mixture components), the effective count can rise further. A document that appears three times in the corpus and is trained on for two epochs has a repetition count of six from the model's perspective.

How deduplication breaks the memorisation feedback loop

Lee et al. (2022) demonstrated the empirical link directly. After exact and near-deduplication of the training corpora:

Models emitted memorised text ten times less frequently on extraction-attack probes.
Perplexity on held-out data improved, confirming that duplicates were consuming capacity without adding generalisation signal.
Training required fewer steps to reach equivalent accuracy, because every gradient update carried novel signal.

The mechanism is straightforward. A string seen N times contributes N gradient updates pushing the model toward its exact reproduction. Remove N-1 copies, and the model sees one example, whose loss competes equally with the millions of other unique training examples. The memorisation pressure drops to near zero.

There is a subtle secondary effect worth noting: duplicate removal also reduces train-test contamination. If a benchmark question appears verbatim in the training corpus, a model achieves high benchmark accuracy by retrieval rather than reasoning. Deduplicating the pretraining corpus against benchmark validation sets (decontamination) collapses this shortcut. Lee et al. found that over 4% of standard benchmark validation sets overlapped with undeduplicated C4.

Deduplication methods and their coverage

Three families of method are in common use, with different precision-recall trade-offs:

Exact substring matching via suffix arrays. Constructs a suffix array over the entire corpus token sequence, then scans for any substring of length >= L shared between two documents. Highly precise; computationally expensive at web scale. Used by Pythia and in the work of Lee et al. for measuring memorisation after deduplication.

MinHash + LSH (fuzzy/document-level). Represents each document as a set of n-grams, hashes them with multiple hash functions, and identifies documents whose Jaccard similarity exceeds a threshold (typically 0.8). Scales to hundreds of billions of tokens in hours on a CPU cluster. Produces false positives (legitimately similar documents removed) but covers near-duplicates that exact methods miss.

Semantic deduplication. Embeds documents with a pretrained encoder, clusters by cosine similarity, and retains one representative per cluster. D4 (Tirumala et al., 2023) showed that combining MinHash deduplication with this diversification step produced a 20% training efficiency gain and up to 2 points of downstream accuracy improvement at 6.7B scale.

In practice, most large-scale pipelines apply MinHash deduplication first (cheap, high recall for near-duplicates), then optionally exact suffix-array deduplication for a second pass targeted at verbatim substrings.

When it falls down

Cross-document substring memorisation survives document-level dedup. MinHash operates at the document level; it removes documents that are globally similar. But a paragraph that is copied across thousands of documents with surrounding variation will not trigger the similarity threshold. Each document looks unique at the Jaccard level while sharing a memorisable substring. Only token- or suffix-level methods catch this.

Aggressive deduplication degrades low-resource content. If a language has few native-language documents, near-duplicates may be the only training signal available. Deduplication at a global similarity threshold designed for English will disproportionately remove content from morphologically rich or low-resource languages.

MinHash thresholds are not universal. A Jaccard threshold of 0.8 that works well for news articles may incorrectly deduplicate structured formats like code (where small edits are semantically significant) or legal templates (where near-identical structure is expected but content differs). Threshold selection needs per-domain calibration.

Deduplication does not eliminate privacy risk entirely. A document containing PII that appears only once survives deduplication intact. Memorisation of singletons still occurs at sufficient model scale and with enough inferential context (Carlini et al., 2021 demonstrated singleton extraction from GPT-2). Deduplication reduces risk; it does not substitute for PII scrubbing.

Order of operations matters. Deduplicating before quality filtering is cheaper (fewer documents to process in the slower classifier pass) but means quality filtering may recover duplicates if different quality-filtered subsets both retain copies of the same document. Running deduplication last is slower but produces cleaner guarantees.

What memorisation actually means in this context

Why duplicates accumulate

How deduplication breaks the memorisation feedback loop

Deduplication methods and their coverage

When it falls down

Further reading