Foundations
Proximal Policy Optimisation
PPO stabilises policy gradient training by clipping the probability ratio between old and new policies, preventing destructively large updates without the computational overhead of second-order methods.
advanced · 8 min read · Premium
Before PPO, the practical choice for continuous-control RL was brutal: either accept the instability of vanilla policy gradients, or pay the cost of Trust Region Policy Optimisation (TRPO), which required conjugate gradients and a constrained optimisation step so expensive it was effectively inaccessible for large neural networks. Schulman et al. shipped PPO in 2017 as a direct response to that tradeoff. It is now the default training algorithm behind ChatGPT's RLHF stage, most of Google DeepMind's locomotion work, and a large fraction of everything labelled "fine-tuned with RL" in the past five years.
The Core Tension: Stability vs. Simplicity
Vanilla policy gradient methods update parameters by following the gradient of the expected return:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.