Architectures & Scaling
Pretraining vs Instruction Tuning
How a base model becomes a chatbot. Three stages — pretraining, supervised fine-tuning, RLHF — each doing something specific.
intermediate · 9 min read
Three stages, three objectives
A modern chat model like Claude or GPT-4 didn't come from one training run. It went through three distinct phases:
1. Pretraining
The model is shown trillions of tokens of raw text and trained to predict the next token. This is unsupervised — no labels, just the corpus. The resulting base model can complete text but has no notion of "helpful assistant". Ask it a question and it might continue with more questions, because that's what its training data did.
2. Supervised Fine-Tuning (SFT)
Humans write thousands of high-quality (instruction, response) pairs. The model is trained to imitate them. After this it behaves like an assistant — answers questions, follows instructions — but its outputs may be confidently wrong or unsafe.
3. Reinforcement Learning from Human Feedback (RLHF)
Humans rank pairs of responses by preference. A reward model is trained on those rankings. The base model is then fine-tuned (typically with PPO or DPO) to maximise the reward model's score. This is what makes responses feel helpful — concise, accurate, refusing harmful requests.
Why this matters for users
- System prompts are how the SFT/RLHF phase taught the model to take role-conditioning seriously. Use them.
- Refusals come from RLHF — the model was rewarded for refusing certain content.
- "Hallucinations" are partly a pretraining artefact (the model was rewarded for plausible-sounding text) that RLHF only partly corrects.
DPO is the modern default
Direct Preference Optimization (Rafailov et al., 2023) skips the reward-model + PPO step by directly fitting preferences. It is simpler, more stable, and now the default at most labs.