Proximal Policy Optimisation

Before PPO, the practical choice for continuous-control RL was brutal: either accept the instability of vanilla policy gradients, or pay the cost of Trust Region Policy Optimisation (TRPO), which required conjugate gradients and a constrained optimisation step so expensive it was effectively inaccessible for large neural networks. Schulman et al. shipped PPO in 2017 as a direct response to that tradeoff. It is now the default training algorithm behind ChatGPT's RLHF stage, most of Google DeepMind's locomotion work, and a large fraction of everything labelled "fine-tuned with RL" in the past five years.

The Core Tension: Stability vs. Simplicity

Vanilla policy gradient methods update parameters by following the gradient of the expected return:

∇J(θ) = E[ ∇ log π_θ(a|s) · A(s, a) ]

where A(s, a) is the advantage - how much better action a is relative to the baseline. The problem is step size. A large learning rate can catapult the policy into a region where the old samples no longer represent the distribution well, and since the old samples are exactly what you are training on, the gradient signal becomes wrong in a self-reinforcing way. Recovery is slow or impossible.

TRPO solved this by enforcing an explicit KL constraint:

maximise  E[ (π_θ / π_θ_old) · A ]
subject to  KL( π_θ_old || π_θ ) ≤ δ

Correct, but the constrained step involves computing the Fisher information matrix and running an inner conjugate-gradient loop, which is incompatible with most deep learning toolchains and requires careful implementation.

PPO's insight: you do not need the exact constrained solution. You just need to prevent the ratio from straying too far. Clipping does that cheaply.

The Clipped Surrogate Objective

Define the probability ratio:

r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t)

The unclipped surrogate objective is E[ r_t(θ) · A_t ]. PPO-Clip modifies it:

L_CLIP(θ) = E[ min( r_t(θ) · A_t,  clip(r_t(θ), 1-ε, 1+ε) · A_t ) ]

The min is the operative word. Consider two cases:

Advantage sign	What clipping does
A_t > 0 (good action)	r_t is allowed to rise at most to 1+ε before the objective stops increasing; no incentive to push the ratio further
A_t < 0 (bad action)	r_t is allowed to fall at most to 1-ε before the objective stops decreasing; prevents aggressively supressing the action beyond the trust region

The result is a pessimistic lower bound on the unclipped objective. Gradient updates can only improve performance within the clip boundary; any attempted step outside it yields zero gradient from the clipped term. Typical ε is 0.1 to 0.2.

There is also a KL-penalty variant (PPO-KL), which adds a penalty β · KL(π_θ_old, π_θ) to the loss and adaptively adjusts β depending on whether the measured KL divergence exceeds a target. In practice, PPO-Clip is more common because it does not require tuning the penalty coefficient.

The Core Tension: Stability vs. Simplicity

The Clipped Surrogate Objective

Keep reading with Pro.