← Concept library

Foundations

Proximal Policy Optimisation

PPO stabilises policy gradient training by clipping the probability ratio between old and new policies, preventing destructively large updates without the computational overhead of second-order methods.

advanced · 8 min read · Premium

Before PPO, the practical choice for continuous-control RL was brutal: either accept the instability of vanilla policy gradients, or pay the cost of Trust Region Policy Optimisation (TRPO), which required conjugate gradients and a constrained optimisation step so expensive it was effectively inaccessible for large neural networks. Schulman et al. shipped PPO in 2017 as a direct response to that tradeoff. It is now the default training algorithm behind ChatGPT's RLHF stage, most of Google DeepMind's locomotion work, and a large fraction of everything labelled "fine-tuned with RL" in the past five years.

The Core Tension: Stability vs. Simplicity

Vanilla policy gradient methods update parameters by following the gradient of the expected return:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied