← Concept library

Foundations

Trust-Region Policy Optimisation

TRPO is a policy-gradient algorithm that enforces a KL-divergence constraint on each update, guaranteeing monotonic policy improvement and preventing the catastrophic performance collapses that plague vanilla gradient ascent.

advanced · 5 min read · Premium

Policy gradient methods have a structural problem that kills training runs without warning: a single bad step can collapse a policy from expert-level behaviour into random flailing, and gradient ascent gives you no mechanism to detect or prevent this. In 2015, Schulman, Levine, Moritz, Jordan, and Abbeel published Trust Region Policy Optimisation (TRPO), a method that surrounds each update with a hard geometric constraint, turning an unbounded hill-climbing problem into a principled constrained optimisation.

The core instability vanilla policy gradients cannot fix

Standard REINFORCE and its variants maximise the expected return by computing:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied