Foundations
The KL-Regularised RL Objective
The KL-regularised RL objective balances reward maximisation against a penalty that keeps the policy close to a reference model, preventing reward hacking while allowing genuine improvement.
advanced · 8 min read · Premium
A 1.3 billion-parameter InstructGPT model is preferred by human raters over a 175x larger raw GPT-3, despite seeing no additional capability training. The difference is not scale; it is the objective the model was optimised against. That objective contains a term most practitioners gloss over: the KL divergence penalty. Understanding why it is there, what it is doing at every training step, and precisely where it breaks down separates practitioners who can tune RLHF pipelines from those who cargo-cult hyperparameters.
The Bare Reward Problem
Reinforcement learning requires a scalar reward signal. In language model post-training the reward typically comes from a learned reward model (RM) trained on human preference comparisons. Naively, the RL objective is:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.