Foundations
PPO for Language Models
Proximal Policy Optimisation clips the policy update ratio to prevent destructive gradient steps, making it the workhorse algorithm for RLHF fine-tuning of large language models.
advanced · 8 min read · Premium
GPT-3 could write fluent prose about almost anything, yet early user studies found it frequently produced responses that were unhelpful, dishonest, or subtly toxic. The gap between "predicts plausible tokens" and "behaves as intended" is not closed by more pretraining data. It is closed by optimisation against a signal of human preference, and the algorithm that made this practical at scale is Proximal Policy Optimisation (PPO).
What PPO actually optimises
Standard policy gradient methods compute the gradient of expected reward with respect to policy parameters. The update rule is:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.