← Concept library

Foundations

Generalised Advantage Estimation

GAE introduces a single hyperparameter lambda that smoothly interpolates between high-bias/low-variance TD(0) and low-bias/high-variance Monte Carlo advantage estimates, making policy gradient training substantially more stable.

advanced · 8 min read · Premium

Policy gradient methods have a straightforward variance problem: the simplest unbiased advantage estimate requires rolling out entire trajectories, and the resulting signal is so noisy that learning often diverges before it converges. The standard fix, subtracting a value-function baseline, helps but does not eliminate the problem. Generalised Advantage Estimation (GAE), introduced by Schulman, Moritz, Levine, Jordan, and Abbeel in 2015, attacks the remaining variance systematically by exponentially down-weighting contributions from rewards further into the future, trading a small amount of bias for a large reduction in variance.

Every major modern policy-optimisation algorithm, including PPO, uses GAE. Understanding it is therefore not optional for anyone serious about reinforcement learning.

The advantage function and why estimating it is hard

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied