Architectures & Scaling
PII Detection and Removal
Scrubbing personally identifiable information from web-scale corpora before LLM pretraining reduces memorisation risk and legal exposure, but every detection method trades recall against corpus damage.
intermediate · 8 min read
In 2020, Carlini et al. queried GPT-2 and recovered verbatim phone numbers, names, and email addresses that had appeared exactly once in its training set. The model had not been asked for this information; it surfaced because the pretraining objective rewards predicting the next token accurately, and rare but memorable sequences fit that objective just as well as syntactic patterns. That single experiment shifted PII handling from a compliance checkbox into a first-class data-pipeline concern for every lab building a foundation model.
Why the risk is structural, not incidental
A crawl of Common Crawl or a dump of GitHub contains personally identifiable information not because scrapers are careless but because people publish it: resumes on personal sites, email addresses in forum threads, phone numbers in classifieds, social-security fragments in leaked documents that found their way into archive.org. The volume is large enough that even aggressive deduplication leaves millions of PII sequences in a 3-trillion-token corpus.
The danger compounds with scale. Carlini et al. (2022) showed that memorisation scales roughly as a log-linear function of model size: larger models memorise more sequences, and sequences that appear multiple times are memorised at disproportionately higher rates. This means the standard argument "the model won't remember a single occurrence" fails for large enough models, and "deduplicate first" only partially fixes the problem because some PII-bearing documents are genuinely unique.
The legal dimension is distinct from the safety one. GDPR Article 17 grants EU residents a right to erasure; if a model has memorised their data, "erasure" becomes a model-retraining exercise, not a database DELETE. Training on PII from children is prohibited in several jurisdictions outright. These constraints motivate scrubbing before training rather than trying to patch the model post-hoc.
Detection strategies: from patterns to models
PII detection in corpora is a named-entity recognition (NER) problem with unusual requirements. A production web-scale scrubber must process hundreds of billions of tokens at low cost; it cannot afford the latency of a large transformer on every document.
Regex and heuristic rules remain the first line of defence. Most pipelines maintain hand-curated patterns for well-formatted PII:
# Example pattern family (simplified)
EMAIL = r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}"
US_PHONE = r"\b(\+1[\s\-]?)?\(?\d{3}\)?[\s\-]?\d{3}[\s\-]?\d{4}\b"
SSN = r"\b\d{3}[- ]\d{2}[- ]\d{4}\b"
IP_V4 = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"
These patterns are cheap (linear in token count), easy to audit, and handle the bulk of structured PII. Their limitation is precision: the SSN pattern above also matches formatted dates and part numbers, and coverage drops to near zero for prose that contains PII without syntactic structure ("call Bob at the office on Fifth" names no phone number but encodes one implicitly).
Lightweight NER models cover unstructured PII. A distilled BERT-class sequence labeller (e.g., 12 layers, ~110M parameters) trained on annotated news and web text can tag PERSON, LOCATION, ORG, and similar entity types at roughly 50k tokens/second on a single CPU core. The AllenAI Dolma toolkit takes this approach, combining fast heuristics with a small NER model to tag and redact personal names and email addresses across their 3-trillion-token corpus (Soldaini et al., 2024).
Blocklist matching handles a specific sub-problem: known-bad identifiers. If a scraper has ingested a breach dump and the SHA-256 of each record can be indexed, any document containing those exact strings can be dropped. Blocklists are fast and have zero false-positive rate for exact matches, but they only cover known leaks and miss paraphrased or partial disclosures.
Large-model verification is used selectively at quality-control time rather than in the hot path. A pipeline might sample 0.1% of documents, run GPT-class inference to flag residual PII, and use the results to calibrate the earlier fast stages. This is the "offline auditor" pattern: expensive but accurate, used to measure recall of the cheap filter rather than to perform the scrubbing itself.
The typical pipeline combines these in a cascade:
for doc in corpus:
doc = apply_regex_redaction(doc) # fast, high precision
doc = apply_ner_redaction(doc) # moderate cost, broader coverage
if blocklist_hit(doc): # O(1) hash lookup
drop(doc)
emit(doc)
Redaction can mean full token deletion, replacement with a typed sentinel ([EMAIL], [PHONE]), or replacement with a synthetic value drawn from a distribution that preserves local grammar. Deletion is simplest; sentinel replacement is preferred because it preserves document structure for downstream quality filters; synthetic replacement is occasionally used to maintain readability at the cost of introducing fabricated data into the training set (a different kind of risk).
Interaction with deduplication
Deduplication and PII removal are often described as independent pipeline stages, but their ordering matters. If deduplication runs first and collapses near-duplicate documents, PII that appeared in many copies of the same forum post survives into the deduplicated set but in reduced count; the NER stage then sees fewer instances to remove. Conversely, running PII scrubbing before deduplication means some documents with partial redactions no longer match their near-duplicates at all, which can confuse MinHash or SimHash-based deduplication. Most large-scale pipelines (FineWeb, Dolma) run deduplication first and PII scrubbing second, accepting that residual high-frequency PII will be caught downstream.
When it falls down
High-recall scrubbing damages fluency. An aggressive NER model with a low confidence threshold redacts common words that happen to be personal names in context ("Cook announced..." loses the CEO reference; "May decided..." loses a prime minister; "Jordan played..." loses a basketball player). Downstream, the model learns corrupted n-gram statistics around sentinel tokens, and performance on tasks involving named entities degrades. The Lukas et al. (2023) study found that even sentence-level differential privacy, which is stronger than redaction, still leaks roughly 3% of PII sequences, suggesting that marginal increases in scrubbing aggressiveness yield diminishing privacy returns while compounding fluency costs.
Regex patterns are locale-dependent. A US-phone-number pattern misses Indian, German, and Chinese formats. A pipeline tuned on English-language Common Crawl and then applied to a multilingual corpus will under-scrub systematically in non-English documents. Building locale-aware pattern sets is an ongoing maintenance burden.
PII in code is structurally different. Email addresses in Python source code are often constants, API keys, or test fixtures. A scrubber that redacts all EMAIL matches in code will break syntactically valid import paths, test harnesses, and documentation strings. Code-aware scrubbing requires at minimum a language-detection step before applying patterns.
Indirect identification survives scrubbing. Removing a name and phone number from a document does not remove quasi-identifiers: zip code, birth year, and occupation jointly re-identify over 85% of US residents in studies by Sweeney (1997, widely cited though the original work predates the deep-learning era). A scrubber that targets only direct PII categories leaves high-dimensional quasi-identifier combinations intact. Addressing this requires either much heavier redaction (which destroys utility) or a training regime that applies differential privacy during the gradient step.
Evaluation is hard. There is no single ground-truth PII corpus for web-scale text. Recall is measured by injecting synthetic PII into documents and checking whether the scrubber catches it; but synthetic PII may not reflect the distribution of naturally occurring PII in crawl data. Precision is measured by human review of redacted spans, which is expensive and cannot scale to trillions of tokens.
Further reading
- Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security. https://arxiv.org/abs/2012.07805
- Carlini, N. et al. (2022). "Quantifying Memorization Across Neural Language Models." ICLR 2023. https://arxiv.org/abs/2202.07646
- Lukas, N. et al. (2023). "Analyzing Leakage of Personally Identifiable Information in Language Models." IEEE S&P. https://arxiv.org/abs/2302.00539
- Soldaini, L. et al. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." https://arxiv.org/abs/2402.00159