Pretraining vs Instruction Tuning

Three stages, three objectives

A modern chat model like Claude or GPT-4 didn't come from one training run. It went through three distinct phases:

1. Pretraining

The model is shown trillions of tokens of raw text and trained to predict the next token. This is unsupervised — no labels, just the corpus. The resulting base model can complete text but has no notion of "helpful assistant". Ask it a question and it might continue with more questions, because that's what its training data did.

2. Supervised Fine-Tuning (SFT)

Humans write thousands of high-quality (instruction, response) pairs. The model is trained to imitate them. After this it behaves like an assistant — answers questions, follows instructions — but its outputs may be confidently wrong or unsafe.

3. Reinforcement Learning from Human Feedback (RLHF)

Humans rank pairs of responses by preference. A reward model is trained on those rankings. The base model is then fine-tuned (typically with PPO or DPO) to maximise the reward model's score. This is what makes responses feel helpful — concise, accurate, refusing harmful requests.

Why this matters for users

System prompts are how the SFT/RLHF phase taught the model to take role-conditioning seriously. Use them.
Refusals come from RLHF — the model was rewarded for refusing certain content.
"Hallucinations" are partly a pretraining artefact (the model was rewarded for plausible-sounding text) that RLHF only partly corrects.

DPO is the modern default

Direct Preference Optimization (Rafailov et al., 2023) skips the reward-model + PPO step by directly fitting preferences. It is simpler, more stable, and now the default at most labs.