← Concept library

Foundations

PPO for Language Models

Proximal Policy Optimisation clips the policy update ratio to prevent destructive gradient steps, making it the workhorse algorithm for RLHF fine-tuning of large language models.

advanced · 8 min read · Premium

GPT-3 could write fluent prose about almost anything, yet early user studies found it frequently produced responses that were unhelpful, dishonest, or subtly toxic. The gap between "predicts plausible tokens" and "behaves as intended" is not closed by more pretraining data. It is closed by optimisation against a signal of human preference, and the algorithm that made this practical at scale is Proximal Policy Optimisation (PPO).

What PPO actually optimises

Standard policy gradient methods compute the gradient of expected reward with respect to policy parameters. The update rule is:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied