Trust-Region Policy Optimisation

Policy gradient methods have a structural problem that kills training runs without warning: a single bad step can collapse a policy from expert-level behaviour into random flailing, and gradient ascent gives you no mechanism to detect or prevent this. In 2015, Schulman, Levine, Moritz, Jordan, and Abbeel published Trust Region Policy Optimisation (TRPO), a method that surrounds each update with a hard geometric constraint, turning an unbounded hill-climbing problem into a principled constrained optimisation.

The core instability vanilla policy gradients cannot fix

Standard REINFORCE and its variants maximise the expected return by computing:

∇J(θ) = E[∇ log π_θ(a|s) · A(s, a)]

where A(s, a) is the advantage estimate. The gradient points uphill in parameter space, but "uphill in parameter space" does not map cleanly onto "better policy". A large step in θ-space can move the policy distribution dramatically, invalidating the advantage estimates that were computed under the old policy. The result is an off-distribution update that often over-corrects, destabilising training.

The natural gradient partially addresses this by pre-multiplying by the inverse Fisher information matrix, which rescales the step into policy-distribution space rather than parameter space. TRPO formalises this intuition as an explicit constraint.

The trust-region objective

TRPO replaces the unconstrained gradient step with a constrained optimisation:

maximise over θ:   L(θ_old, θ) = E_s,a ~ π_old [ (π_θ(a|s) / π_old(a|s)) · A_old(s, a) ]

subject to:        E_s [ KL( π_old(·|s) || π_θ(·|s) ) ] ≤ δ

Term	Role
Importance ratio π_θ / π_old	Re-weights old samples to evaluate new policy
A_old	Advantage under old policy; keeps evaluation valid
KL constraint (≤ δ)	Hard ceiling on how far the distribution can shift

The quantity L is the "surrogate objective": it is a first-order approximation to the true objective improvement, valid as long as the new policy stays close to the old one. The KL constraint operationalises "close" in distribution space rather than parameter space, so the bound on policy degradation is meaningful regardless of the parameterisation.

The theoretical backbone is the Kakade-Langford policy improvement bound: the true expected return of a new policy can be bounded below as a function of the surrogate objective and the KL divergence. Keeping the KL within δ therefore provides a (noisy, approximate) monotonic improvement guarantee.

Solving the constrained problem efficiently

The naive approach, forming and inverting the full Fisher information matrix, costs O(n^2) in the number of parameters. For a neural network with millions of weights this is completely impractical. TRPO sidesteps this with two tricks.

Conjugate gradient. Instead of inverting the Fisher matrix F, TRPO solves the linear system F·x = g (where g is the policy gradient) iteratively using conjugate gradient, requiring only Hessian-vector products. A Hessian-vector product Fv can be computed with two backward passes and costs the same as a single gradient computation.

The core instability vanilla policy gradients cannot fix

The trust-region objective

Solving the constrained problem efficiently

Keep reading with Pro.