The Post-Training Pipeline

GPT-3 answered questions. InstructGPT, produced from the same base weights with roughly 100x fewer parameters, was preferred by human raters in the overwhelming majority of comparisons. The difference was not architecture or scale: it was post-training. That single result, published by Ouyang et al. in 2022, made the post-training pipeline the most commercially consequential innovation in modern NLP.

This concept maps each stage of that pipeline: what it does, why that ordering matters, and where it quietly fails.

Stage 1: Supervised Fine-Tuning (SFT)

A pretrained base model is a distribution over plausible next tokens. Left to itself, it will complete "What is the capital of France?" with "What is the capital of Germany? What is the capital of Spain? ..." because that is the structure of web documents. SFT breaks this pattern.

In SFT you collect a dataset of (instruction, high-quality response) pairs, written or curated by human annotators, and run ordinary cross-entropy training on that data. The model learns to map an instruction-style prompt to a completion-style response rather than a document-continuation.

Key properties of the SFT dataset:

Property	Typical target
Volume	10k - 100k examples
Annotation source	Expert human labellers, or strong model + human review
Format	Instruction + response, multi-turn dialogues
Quality bar	Correctness, coherence, appropriate length

SFT alone produces a model that sounds like an assistant. It will be helpful on common queries, but it will also confidently confabulate, refuse inconsistently, and rank verbose responses above accurate ones, because those behaviours were present in at least some training demonstrations. This motivates the next two stages.

Stage 2: Reward Modelling

You cannot directly tell a model "be more helpful and less harmful" via gradient descent. You need a differentiable proxy for that preference. The reward model (RM) is that proxy.

Training procedure:

Take a prompt. Generate several completions from the SFT model.
Ask human raters to rank those completions (preferred over less-preferred).
Convert rankings into pairwise comparisons: (prompt, chosen, rejected).
Fine-tune a copy of the language model, replacing the final layer with a scalar head, to predict r(prompt, completion).
Optimise with a Bradley-Terry pairwise loss:

\[\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)\right]\]

where y_w is the preferred completion and y_l is the less preferred one.

The RM is intentionally a separate model from the policy. Sharing weights introduces gradient interference; using separate weights lets you freeze the RM after training and treat it as an oracle during the RL stage.

The reward model is only as reliable as the human annotations it was trained on. Inter-annotator agreement on subjective qualities such as "helpfulness" is often below 75%, meaning the RM is learning a noisy signal from the start.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

With an RM in hand, you now optimise the SFT model (the policy) to maximise the RM's score. This is an RL problem: the policy produces a completion (action), the RM scores it (reward), and PPO (Proximal Policy Optimisation) updates the policy weights.

The objective used in practice adds a KL-divergence penalty:

\[\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(y|x)}\left[r_\phi(x, y) - \beta \, \text{KL}\!\left[\pi_\theta(y|x) \,\|\, \pi_\text{SFT}(y|x)\right]\right]\]

The KL term keeps the RLHF-tuned policy close to the SFT checkpoint. Without it, the policy finds "reward hacks": short outputs that score well on the RM but are useless to actual users. The coefficient beta controls this trade-off and is typically set between 0.01 and 0.1.

RLHF is operationally complex: you need four models loaded simultaneously (policy, reference SFT model, reward model, and a value model for PPO). Memory footprint is substantial, training is sensitive to hyperparameters, and reward hacking is a persistent concern.

Stage 4: DPO and the Preference-Optimisation Family

The mathematical insight behind DPO (Rafailov et al., 2023) is that the optimal RLHF policy can be expressed in closed form given the reward function. Substituting that expression back into the RLHF objective yields a loss defined directly over preference pairs, with no separate RM and no RL loop:

\[\mathcal{L}_{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]\]

In plain terms: the policy is penalised when it assigns more relative probability to a losing completion over the reference model than to a winning one. The reference model (frozen SFT checkpoint) plays the same KL-anchoring role as in RLHF.

Practical comparison:

Criterion	RLHF (PPO)	DPO
Models in memory during training	4	2
Requires online sampling	Yes	No (offline data)
Stability	Moderate	High
Expressiveness	High	Moderate
Reward hacking risk	High	Lower

DPO and its variants (IPO, KTO, ORPO, SimPO) now dominate open-source post-training recipes. Most publicly released instruction-tuned models from 2024 onwards use some form of offline preference optimisation rather than full PPO.

Merging: A Post-Hoc Shortcut

After full RLHF or DPO, practitioners often merge specialised checkpoints rather than training a single model on all objectives. Task arithmetic (Ilharco et al., 2022) showed that the delta weights between a fine-tuned model and its base can be treated as vectors: you can add multiple such vectors to produce a model that inherits capabilities from each.

A common merging recipe for chat models:

base_weights = pretrained_checkpoint
sft_delta   = sft_weights - base_weights
dpo_delta   = dpo_weights - base_weights

merged = base_weights + alpha * sft_delta + beta * dpo_delta

SLERP (spherical linear interpolation in weight space) and DARE-TIES are refinements that handle conflicting parameter updates more gracefully. Merging is fast (no GPU training), but the resulting model is untested by construction and can combine failure modes from each source checkpoint as readily as it combines strengths.

When It Falls Down

Reward hacking in RLHF. The RM is a finite-capacity model trained on finite data. Maximising it with RL creates out-of-distribution inputs the RM was not trained to evaluate. The policy exploits these, producing long or verbose outputs that fool the RM while providing little real value. The KL penalty slows this but does not stop it.

Distribution shift in SFT data. If annotators write responses in a particular style or length range, the SFT model inherits those biases. Brevity, list formatting, and over-hedging ("As an AI language model...") are all SFT artefacts, not model beliefs.

DPO's implicit reward is brittle. DPO does not learn an explicit reward model, so you cannot inspect the reward signal for sanity. Pathological preference data (noisy labels, mislabelled comparisons) quietly degrades the policy in ways that are hard to diagnose.

Annotation quality degrades at scale. As annotation volume grows, average annotator quality drops and coverage of difficult edge cases is thin. The RM learns to be confidently wrong about rare but important cases.

Merging is unprincipled. There is no guarantee that two delta vectors are orthogonal. Merge coefficients are typically chosen by grid search on benchmarks, not by any theoretical criterion, and generalisation off-benchmark is untested.

The SFT ceiling. All downstream stages are bounded by SFT data quality. If the SFT demonstrations contain factual errors, the RM will be trained partly on errors as ground-truth positives, and no amount of RL will fully correct this.