Generalised Advantage Estimation

Policy gradient methods have a straightforward variance problem: the simplest unbiased advantage estimate requires rolling out entire trajectories, and the resulting signal is so noisy that learning often diverges before it converges. The standard fix, subtracting a value-function baseline, helps but does not eliminate the problem. Generalised Advantage Estimation (GAE), introduced by Schulman, Moritz, Levine, Jordan, and Abbeel in 2015, attacks the remaining variance systematically by exponentially down-weighting contributions from rewards further into the future, trading a small amount of bias for a large reduction in variance.

Every major modern policy-optimisation algorithm, including PPO, uses GAE. Understanding it is therefore not optional for anyone serious about reinforcement learning.

The advantage function and why estimating it is hard

The advantage of taking action \(a\) in state \(s\) under policy \(\pi\) is:

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

It measures how much better (or worse) action \(a\) is compared with the average action the policy would take. A perfect advantage signal would tell the optimiser exactly which decisions to reinforce. In practice, we never have the true \(Q^\pi\) or \(V^\pi\); we must estimate them from sampled rollouts.

Two extreme strategies exist:

Estimator	Bias	Variance	Requires
Monte Carlo return \(\hat{A}^{MC}\)	Zero (given infinite data)	High (long-horizon noise compounds)	Full episode
One-step TD residual \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\)	Low but nonzero (value fn errors propagate)	Low	Single step + \(V\)

Neither extreme is satisfactory on its own. Monte Carlo estimates are unbiased but too noisy for reliable gradient computation. TD estimates are smooth but inherit whatever errors live in the value function.

The GAE formula

GAE defines the advantage at time \(t\) as a geometrically-weighted sum of \(k\)-step TD residuals:

\[\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\]

where \(\delta_{t+l} = r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l})\) is the TD residual at step \(t+l\).

The parameter \(\lambda \in [0, 1]\) is the central control knob:

\(\lambda = 0\): collapses to the single-step TD residual \(\delta_t\). Maximum bias, minimum variance.
\(\lambda = 1\): collapses to the full Monte Carlo advantage (minus the baseline). Zero bias (given a perfect \(V\)), maximum variance.
\(\lambda \in (0, 1)\): a smooth interpolation. In practice, values of 0.95 to 0.99 work well across a wide range of continuous-control tasks.

The discount \(\gamma\) plays its usual role of reducing the effective horizon; \(\lambda\) is an additional, independent control over the bias-variance balance.

Recursive computation

The infinite sum looks expensive but collapses to a one-pass backward scan through the trajectory. Starting from the end of the collected rollout (length \(T\)):

delta[t] = r[t] + gamma * V[t+1] - V[t]
A[T-1] = delta[T-1]
for t in reversed(range(T-1)):
    A[t] = delta[t] + gamma * lambda * A[t+1]

The advantage function and why estimating it is hard

The GAE formula

Recursive computation

Keep reading with Pro.