Data Provenance and Licensing

When researchers audited over 1,800 text datasets used to train and fine-tune language models, they found licence-omission rates above 70% and error rates above 50% on the most popular dataset-hosting platforms. The models people deploy today were almost certainly trained on data whose legal status nobody formally verified. That is not a niche compliance concern; it is a structural vulnerability in how the field builds foundation models.

What provenance actually means in practice

Provenance is the documented chain linking a piece of text in your training set back to its original source. A fully specified provenance record answers at least four questions:

Where did this text originate? (URL, publisher, database, synthetic generator)
What transformations were applied? (HTML stripping, language-ID filtering, deduplication, quality scoring)
What was the licence or terms of service at collection time?
Has that licence since changed? (Sites routinely tighten their terms after training runs complete.)

In practice, most web-crawl pipelines do not record all four. Common Crawl snapshots link to a WARC record, which gives you the URL and a rough timestamp, but nothing about the site's terms of service or whether the content is copyrighted, syndicated, or machine-generated. When you train on a 15TB filtered Common Crawl slice, you inherit that ambiguity at scale.

The Data Provenance Initiative (Longpre et al., 2023) introduced a taxonomy of licence categories that is now a useful reference:

Licence class	Example	Commercial use	Derivative works
Public domain / CC0	Project Gutenberg pre-1928	Yes	Yes
Permissive open (CC-BY)	Wikipedia	Yes, with attribution	Yes
Non-commercial (CC-BY-NC)	Many academic corpora	No	Conditional
Share-alike (CC-BY-SA)	OpenStreetMap, some Wikipedia forks	Yes	Must re-share under same terms
No licence stated	The majority of the web	Jurisdiction-dependent	Unknown
All rights reserved / ToS-restricted	News sites, Reddit, Twitter/X	Contested	Contested

The last two rows cover most of the text on the internet. This is why legal proceedings against LLM developers have focused on these categories rather than on clearly licensed material.

Why licences are hard to honour at corpus scale

Consider the practical workflow: you download a 70TB Common Crawl snapshot, run a language-identification model to keep English text, apply a quality filter (e.g., a perplexity threshold against a small reference LM), deduplicate with MinHash, and end up with roughly 3TB. That pipeline runs across dozens of machines over several days.

At no point in the standard toolchain does anyone query "what is the licence of this document?" The URL is retained as metadata, but licence information must be fetched separately, is often absent from the page itself, and changes over time. Some datasets attempt to address this:

The Pile (Gao et al., 2021) enumerated 22 sub-sources and stated the licence for each, but the overall licence is effectively the most restrictive sub-licence in the mix.
C4 (the Colossal Clean Crawled Corpus) retains source URLs but provides no licence metadata; Dodge et al. (2021) later documented that it contains content from patents, military sites, and NLP benchmark test sets, none of which were flagged at collection time.

A related problem is licence drift. Reddit's terms of service changed in 2023. Twitter/X's API terms changed in 2023. A dataset collected in 2021 under one terms regime may now be in violation if the crawl were repeated today. Models trained on 2021 data carry the 2021 terms, but the legal landscape around those terms keeps evolving.

There is also the question of granularity mismatch. A Common Crawl page might be a news article (copyright held by the publisher), a user comment thread (copyright held by individual users, licensed to the platform under ToS), or a scraped copy of a GitHub README (MIT-licensed code but copyright still held by contributors). These three documents require three different legal treatments, and they all live in the same crawl shard.

Decontamination and its relationship to provenance

Provenance tracking becomes operationally necessary during decontamination, the process of removing evaluation benchmarks from training data. If you do not know which documents came from which source, you cannot reliably filter out BenchmarkX when it turns out that a data vendor included it in their corpus.

The standard decontamination approach is n-gram overlap: compute character-level or token-level n-grams for each benchmark example, then remove training documents whose overlap exceeds a threshold. GPT-4's technical report and LLaMA's paper both describe variants of this. But the approach requires you to know which benchmarks to decontaminate against, and that list grows over time. A corpus frozen in 2022 cannot be decontaminated against a benchmark released in 2024 without reprocessing from the provenance-annotated raw data.

This is the reason serious pretraining pipelines now store the raw, pre-filter snapshots alongside the processed corpus: you need the full provenance chain to reprocess when new contamination is discovered.

Synthetic data and the provenance paradox

Recent pipelines increasingly use LLM-generated synthetic text, which introduces a novel provenance question. If GPT-4 generates 10 billion tokens of instruction-following data, what is the provenance of that data? The model itself was trained on copyrighted material. Several model providers explicitly prohibit using their outputs to train competing models (OpenAI's terms are the most widely cited example). This creates a dependency chain:

Web (mixed licences)
  → GPT-4 pretraining
    → GPT-4 outputs
      → Your fine-tuning data
        → Your model

The original licence terms of the web data flow, somewhat opaquely, through this chain. Whether that chain creates legal exposure is an open question in multiple jurisdictions. What is not open is that the provenance is real and traceable; you cannot simply assert that synthetic data has no provenance because it was generated rather than copied.

When it falls down

Retroactive licence changes. A site that was permissively licensed when crawled can change its terms after the fact. Your training corpus does not update; your legal exposure might.

Attribution at inference time. Some open licences (CC-BY, for example) require attribution when the work is reproduced. An LLM that closely reproduces licensed text at inference time arguably triggers attribution requirements, but the model has no mechanism to produce citations unless explicitly designed to do so.

Jurisdiction fragmentation. Fair use (US) and fair dealing (UK, Canada, Australia) diverge meaningfully for ML training. The EU's Text and Data Mining exception under the CDSM Directive allows TDM for research but permits rights-holders to opt out. A corpus that is legally collectable in the US may be actionable in Germany, and vice versa. Global model deployments implicitly export the legal risk of whatever jurisdiction has the narrowest exception.

Opacity of sub-licensing. Many training corpora are assembled from other datasets (e.g., RedPajama assembles from C4, GitHub, Wikipedia, ArXiv, Books, StackExchange). Each layer inherits the most restrictive constraint from the previous layer, and those constraints are not always explicitly propagated in the dataset card.

Proxy labels are not provenance. A dataset card saying "sourced from Common Crawl" is not provenance; it is a category label. Real provenance requires per-document WARC identifiers, snapshot dates, and licence-at-collection-time records. Very few publicly released datasets meet this standard.

What provenance actually means in practice

Why licences are hard to honour at corpus scale

Decontamination and its relationship to provenance

Synthetic data and the provenance paradox

When it falls down

Further reading