Stop fine-tuning for facts: the silent productivity tax of mis-matched tools
May 29, 2026 · 3 min read
I have lost count of the teams I have watched fine-tune a model on their internal docs, ship it to production, and call me six months later asking why it has started confidently citing documents that no longer exist. The answer, every time, is the same: they used the wrong tool.
What fine-tuning actually does
Fine-tuning updates the weights of a pretrained model on your data. After training, the new behaviour is baked in. Inference is single-shot, fast, and stateless. This is the right call when you need to change three things:
- Output format. You want consistent JSON, a specific tone, a refusal style.
- Behaviour patterns. You want the model to extract certain fields from invoices, classify support tickets in your taxonomy, or follow your domain-specific style guide.
- Latency. You cannot afford the round-trip cost of retrieval.
Notice what is not on that list: facts.
Why facts are different
Facts have three properties that fine-tuning handles badly:
- They go stale.
- They have provenance.
- They are too numerous to embed gracefully in weights.
When you fine-tune a model on your knowledge base, you are encoding a snapshot. Your support docs get updated next month and your model still confidently quotes the old version. Worse, when the model hallucinates an answer that sounds like your docs but is not in them, you have no way to trace where it came from. The provenance is gone, dissolved into the weights.
Compounding the problem, fine-tuning encodes facts inefficiently. A 70B model uses ~140GB of weights to store its full pretraining knowledge. Adding a 10MB knowledge base on top is a fraction of a percent of capacity - and it competes with whatever the model already knew. Catastrophic forgetting is real.
What retrieval does instead
Retrieval-augmented generation (RAG) keeps the model frozen. At inference time, you embed the user query, find the relevant chunks of your knowledge base, and inject them into the prompt as context. The model uses the context to answer.
The properties this gives you are exactly the ones fine-tuning lacks:
- Always fresh. Update the index, no retraining.
- Provenance preserved. Every claim traces back to a specific chunk you can cite.
- Scalable. A million-document corpus fits in your vector database, not your weights.
The hybrid that actually wins
The mature production pattern is to use both, but for different jobs:
- Fine-tune the model for output shape, refusal behaviour, and persona. The behaviour that should be stable across all your data goes in the weights.
- Retrieve the facts. The content that changes goes in the index.
Tracking which is which is a discipline. Every quarter, audit your fine-tuning data and ask: did anything in here change? If yes, it belongs in retrieval, not in the next training run.
The decision tree
When a team brings me a new use case, I ask three questions:
- Does the content change month to month? If yes, retrieval.
- Do you need to cite sources? If yes, retrieval.
- Is the behaviour you want stable across your whole corpus? If yes, fine-tune.
The cases where pure fine-tuning wins are narrower than people think. Classification tasks with stable taxonomies. Style transfer. Refusal training. That is mostly it. Everything else is either retrieval or hybrid.
Why this matters now
Frontier model capability is doubling every six months. The cost-per-token is dropping faster than that. The leverage of getting your architecture right - of using fine-tuning for behaviour and retrieval for facts - is compounding. Teams that nail this in 2026 will ship features in days that took quarters in 2024. Teams that keep fine-tuning on facts will keep wondering why their model lies.
The most expensive mistake in modern LLM engineering is not picking the wrong model. It is picking the wrong tool for the job and then spending a year debugging the consequences.