Foundations
Trust-Region Policy Optimisation
TRPO is a policy-gradient algorithm that enforces a KL-divergence constraint on each update, guaranteeing monotonic policy improvement and preventing the catastrophic performance collapses that plague vanilla gradient ascent.
advanced · 5 min read · Premium
Policy gradient methods have a structural problem that kills training runs without warning: a single bad step can collapse a policy from expert-level behaviour into random flailing, and gradient ascent gives you no mechanism to detect or prevent this. In 2015, Schulman, Levine, Moritz, Jordan, and Abbeel published Trust Region Policy Optimisation (TRPO), a method that surrounds each update with a hard geometric constraint, turning an unbounded hill-climbing problem into a principled constrained optimisation.
The core instability vanilla policy gradients cannot fix
Standard REINFORCE and its variants maximise the expected return by computing:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.