Architectures & Scaling
Distillation and Terms-of-Service Constraints
Using a commercial API to generate training data for a competing model almost universally violates the provider's terms of service, and understanding exactly why - and what compliant alternatives exist - is non-negotiable before building any distillation pipeline.
intermediate · 7 min read
Stanford released Alpaca in March 2023: a fine-tuned LLaMA model built on 52,000 instruction-response pairs generated by OpenAI's text-davinci-003 for less than $500. Within weeks, the project's own blog post was carrying a warning that the model "should not be used outside of research" - not because the fine-tuning failed, but because the data pipeline violated OpenAI's terms of service. The capability worked. The legal basis did not.
That tension sits at the heart of this concept. Distillation from a capable teacher is one of the most cost-effective techniques in synthetic data generation, and it is also the one most likely to expose a team to a contract breach, a forced takedown, or a reputational incident. Understanding the constraint is not a legal formality; it shapes what you can ship.
What the terms actually say
Every major commercial LLM provider includes a clause restricting use of model outputs for training competing systems. The precise wording varies, but the substance is consistent.
OpenAI's usage policies prohibit "using outputs from our services to develop models that compete with OpenAI." Anthropic's Acceptable Use Policy prohibits "utilization of inputs and outputs to train an AI model (e.g., 'model scraping' or 'model distillation') without prior authorisation from Anthropic." The Stanford Alpaca authors acknowledged this directly in their release post, noting that data derived from text-davinci-003 came with usage policies that "prohibit developing models that compete with OpenAI."
The clause is not buried in footnotes. It is a first-level restriction in the same tier as prohibitions on illegal content. Providers have the technical and commercial motivation to enforce it: if every capable API call can be cheaply distilled into a competing open model, the provider's moat disappears.
A rough taxonomy of what these clauses cover:
| Action | Typically prohibited? | Notes |
|---|---|---|
| Fine-tuning on decoded API outputs | Yes | The Alpaca case; most providers explicitly name this |
| Using soft logits for KL distillation | Yes | Even richer signal; same prohibition applies |
| Using outputs to build eval sets only | Varies | Some providers permit non-training research use; check the specific terms |
| Generating synthetic data from a self-hosted open-weight model | No | Permitted; the restriction is on the API provider's closed model |
| Using distilled data commercially | Yes, if base data was from a closed API | Downstream commercial use inherits the constraint |
The restriction attaches to the data, not to the model weights. If you generate 10 million pairs from GPT-4, train a model on them, and then continue training that model on other clean data, the original distillation-derived model is still in the derivative chain. Laundering through subsequent fine-tuning does not resolve the underlying provenance problem.
Why open-weight models changed the landscape
Before Meta released LLaMA in February 2023, the only way to access frontier-class language model outputs was through a commercial API. The ToS constraint was therefore also a capability constraint: if you could not use GPT-3/4, you could not distil from a frontier model.
LLaMA and its successors (Mistral, Falcon, Qwen, Gemma, and others) broke that coupling. These models can be run locally, their terms of use vary substantially from provider to provider, and none of them carry a "no competing model" clause in the same way closed APIs do. LLaMA 2's acceptable use policy, for example, does not include a prohibition on distillation; it focuses on harmful applications rather than competitive uses.
This created a two-tier landscape:
- Closed API outputs (GPT-4, Claude, Gemini): Distillation to a competing model is prohibited without explicit authorisation.
- Open-weight model outputs (LLaMA, Mistral, Falcon): Generally permissible for research and, subject to the specific licence, for commercial use.
The practical implication is that teams building on top of open-weight teachers are on much firmer legal ground. Orca (Mukherjee et al., 2023) was trained on GPT-4 explanations - a research project from Microsoft, who holds a commercial partnership with OpenAI that likely covers the data use. Independent teams without such agreements cannot replicate that pipeline for production use.
What distillation without a ToS violation looks like
The constraint does not eliminate distillation as a technique; it shifts the choice of teacher.
Use an open-weight teacher. A LLaMA-3 70B or Mistral Large run locally can generate instruction completions, reasoning traces, and chain-of-thought explanations for a smaller student model. The quality gap relative to GPT-4 is real but often acceptable, particularly in narrowly scoped domains where you can compensate with domain-specific seed prompts and careful quality filtering.
Seek explicit authorisation. Both OpenAI and Anthropic have enterprise agreements and research partnerships that can include distillation rights. If your use case justifies the negotiation overhead, this is a legitimate path. It is the route Microsoft used for Orca and Phi-series models. Budget months, not days.
Generate from the data, not the model. Synthetic data pipelines that use an LLM to augment or transform human-authored source material - rather than generate completions wholesale - occupy a greyer but generally safer area. If a human wrote the original document and the model rewrites it into instruction format, the provenance is less clearly "model output." This is not a complete legal shield, but it substantially reduces exposure.
Distil from your own models. If you have trained or fine-tuned a model on data you own, you can distil from it freely. Progressive distillation - starting with a large fine-tuned model, then cascading down to smaller variants - is standard in production serving and carries no third-party ToS risk.
A minimal compliant pipeline looks like this:
# Compliant: open-weight teacher, local inference
teacher = load_model("meta-llama/Llama-3-70b-instruct", device="local")
student = load_model("meta-llama/Llama-3-8b", device="local")
for prompt in seed_prompts:
response = teacher.generate(prompt, temperature=0.7)
if quality_filter(response):
training_pairs.append((prompt, response))
student.finetune(training_pairs)
# Non-compliant: closed API teacher, commercial product
teacher = openai.ChatCompletion # OpenAI API
student = MyCompetingModel()
for prompt in seed_prompts:
response = teacher.create(model="gpt-4", messages=[...])
training_pairs.append((prompt, response["content"])) # ToS violation
student.finetune(training_pairs) # pipeline built on prohibited data
The difference in code is one line. The compliance difference is significant.
Licence stacking and the derivative model problem
A separate but related constraint comes from model licences rather than API terms. Open-weight models often carry restrictive licences that flow through to derivative models.
LLaMA 2 carried a licence prohibiting use in products with more than 700 million monthly active users - directly targeting large platforms. LLaMA 3 relaxed this but added a requirement to label products as "Built with Meta Llama 3." Falcon (from TII) was released under the Apache 2.0 licence, making it one of the earliest truly permissive frontier-class models.
When you fine-tune or distil from a model, your derivative model inherits the base model's licence constraints unless the licence explicitly permits sublicensing. This means:
- If your teacher is LLaMA 2, your student model is subject to LLaMA 2's licence, including its commercial restrictions.
- If you distil into a model already derived from LLaMA, the chain is still bound by the original licence.
- Switching to a different architecture for the student does not break the chain if the training data came from an LLaMA-derived teacher.
The practical implication is that licence due diligence applies at every node in the pipeline: the teacher, the student's base weights, and any intermediate checkpoints used to generate data.
When it falls down
Ambiguous research exceptions. Many providers permit use of outputs for "research" without clearly defining where research ends and product development begins. A fine-tuned model published as a research artefact but then used in a commercial product is not obviously in either category. The Stanford Alpaca situation illustrates this: the research intent was genuine, but the model and data were downloaded and used commercially by third parties, which created downstream liability risk for the original authors.
Retroactive enforcement. Terms of service can change. A pipeline built on data generated from a provider that previously allowed model training may find itself out of compliance when the provider updates its terms. OpenAI tightened its usage policies multiple times between 2021 and 2024. Data provenance records matter here: knowing exactly when each batch of data was generated, under which version of the ToS, can determine whether a use was permissible at the time.
Partial contamination. Large synthetic data pipelines often mix sources: some data from open-weight models, some from closed APIs (perhaps via a third-party dataset that itself used an API), and some human-authored. If a dataset used during training turns out to contain API-derived outputs, the contamination problem propagates to every model trained on it. This is the same provenance tracking problem as benchmark contamination, but with legal stakes added.
Open-weight does not mean open-data. A model like GPT-J was trained on data whose copyright status is contested. Using that model to generate training data does not resolve the underlying copyright problem in the pretraining corpus - it just moves it one level of indirection away. Copyright law and terms of service are separate instruments, and the former applies regardless of what the latter says.
The "it's just fine-tuning" misconception. Some teams reason that because they are only fine-tuning a student model - not pre-training it - the provenance of the fine-tuning data matters less. This is incorrect. Fine-tuning on prohibited data is still fine-tuning on prohibited data; the training procedure does not change the status of the underlying data.
Further reading
- Taori, R. et al. (2023). "Alpaca: A Strong, Replicable Instruction-Following Model." Stanford CRFM Blog. The original Alpaca release, including the acknowledgement of ToS constraints: https://crfm.stanford.edu/2023/03/13/alpaca.html
- Anthropic (2024). "Anthropic Acceptable Use Policy." The verbatim distillation prohibition clause: https://www.anthropic.com/legal/aup
- Mukherjee, S. et al. (2023). "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." arXiv:2306.02707. An example of GPT-4 distillation conducted under a commercial partnership that covered the data use: https://arxiv.org/abs/2306.02707
- Gudibande, A. et al. (2023). "The False Promise of Imitating Proprietary LLMs." arXiv:2305.15717. Empirical evidence that stylistic imitation does not transfer genuine reasoning capability, undermining the core motivation for ToS-violating distillation: https://arxiv.org/abs/2305.15717