Baselines and Variance Reduction

The naive REINFORCE estimator works, but it requires thousands of trajectories to get a reliable gradient signal. A single rollout in a sparse-reward maze might return all zeros for a long time, then suddenly a large positive return; the resulting gradient updates thrash wildly between near-zero and an explosive spike. The fix is not to collect more data first, it is to subtract a clever reference value that leaves the expectation intact but collapses the variance. That reference value is called a baseline.

The Policy Gradient Problem With Variance

The policy gradient theorem gives us a gradient estimator of the form:

∇_θ J(π_θ) ≈ (1/N) Σ_τ Σ_t ∇_θ log π_θ(aₜ|sₜ) · Gₜ

where Gₜ = Σ_{t'≥t} γ^{t'-t} r_{t'} is the discounted return from step t. The estimator is unbiased; its expectation over trajectories equals the true gradient. The problem is that Gₜ has enormous variance. Returns depend on every random action taken after step t, every stochastic transition, and every noise source in the environment. In practice the variance scales roughly with the square of the episode length and the range of rewards, making naive Monte Carlo policy gradients nearly unusable on long-horizon tasks.

Reducing variance is therefore not optional; it determines whether gradient-based policy optimisation converges at all within a reasonable compute budget.

The Baseline: Bias-Free Variance Reduction

Any function b(sₜ) that depends only on the current state (not on the action taken) can be subtracted from Gₜ without changing the gradient's expectation. The formal justification rests on the Expected Grad-Log-Prob (EGLP) lemma:

E_{aₜ ~ πθ}[∇_θ log π_θ(aₜ|sₜ) · b(sₜ)] = 0

Because b(sₜ) does not depend on aₜ, it factors out of the expectation over actions, and the remaining term integrates to zero by the definition of a normalised probability distribution. The gradient estimator therefore becomes:

∇_θ J(π_θ) ≈ (1/N) Σ_τ Σ_t ∇_θ log π_θ(aₜ|sₜ) · (Gₜ - b(sₜ))

The key intuition: (Gₜ - b(sₜ)) measures whether the return was better or worse than expected under the current state. Actions that led to above-baseline returns get reinforced; actions that led to below-baseline returns get suppressed. The gradient signal now carries relative information rather than raw magnitude, so the variance shrinks substantially.

The optimal variance-minimising baseline is the expected return from that state under the current policy, b*(sₜ) = V^π(sₜ). When b(sₜ) = V^π(sₜ), the subtracted quantity is exactly the advantage function:

Aᵖ(sₜ, aₜ) = Q^π(sₜ, aₜ) - V^π(sₜ)

This is the quantity used in actor-critic methods.

Practical Baselines and the Actor-Critic Family

Baseline choice	How it is computed	Bias	Variance reduction
Constant (mean return)	Average over a batch of episodes	None	Mild
Value function V^π(s)	Separate neural network, trained with TD or MC	None (if exact)	Strong
Generalised Advantage Estimate (GAE-λ)	Exponential weighting of TD residuals	Small (λ < 1)	Very strong

In practice V^π(s) is approximated by a separate value-head or network, trained concurrently with the policy. This introduces a subtle complication: the baseline is now an estimate, not the true V^π, so it carries approximation error. That error does not affect the gradient's expectation (the zero-expectation argument still holds for any fixed b(s)), but the baseline's quality directly governs how much variance is removed.

The Generalised Advantage Estimator (GAE, Schulman et al. 2016) generalises this by mixing single-step TD errors with longer multi-step returns:

Â_t^GAE(γ,λ) = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
where δₜ = rₜ + γ V(sₜ₊₁) - V(sₜ)  [TD residual]

Setting λ = 1 recovers Monte Carlo advantage (low bias, high variance); setting λ = 0 gives the one-step TD advantage (higher bias, lower variance). The λ knob lets practitioners interpolate between the two extremes. PPO, TRPO, and most production actor-critic implementations default to λ ∈ [0.9, 0.99].

Implementing a Value-Function Baseline

A minimal PyTorch sketch shows the two-loss structure:

# One training step (simplified; omits batching and GAE computation)
logits = policy_net(obs)           # shape: [T, |A|]
values = value_net(obs).squeeze()  # shape: [T]

# Advantage: returns - baseline (no gradient through advantage for policy loss)
advantages = returns - values.detach()

# Policy loss: maximise log-prob * advantage
log_probs = F.log_softmax(logits, dim=-1)
selected  = log_probs[range(T), actions]
policy_loss = -(selected * advantages).mean()

# Value loss: fit value net to observed returns
value_loss = F.mse_loss(values, returns)

loss = policy_loss + 0.5 * value_loss
loss.backward()

The 0.5 coefficient on value_loss is a common hyperparameter, reflecting the fact that the two losses operate at different scales. The detach() on values when computing advantages ensures the policy gradient does not flow through the baseline network.

When It Falls Down

Bootstrapping bias compounds. When the value network is poorly initialised, the baseline is systematically wrong. Early in training it can increase effective variance or point gradients in the wrong direction. This bootstrapping bias is worst in sparse-reward or long-horizon settings where the value estimates take many updates to converge.

Shared network representations cause interference. Using a single network with a shared trunk for both policy logits and value estimates (common for parameter efficiency) means value-loss gradients can corrupt policy features. Gradient clipping, separate learning rates, and careful loss weighting all help, but there is no universal solution.

The baseline assumption breaks with non-stationary rewards. The zero-expectation argument holds for b(s) that are functions of state only. If the baseline depends on the action (e.g., a naive action-conditioned baseline), the bias-free guarantee is lost. Some methods (e.g., optimal action-dependent baselines) recover this carefully, but they require substantially more bookkeeping.

GAE-lambda requires accurate value estimates. GAE's variance reduction only materialises if the value function is reasonably accurate. On tasks where the value landscape changes sharply between policy updates (e.g., highly non-stationary opponent play in multi-agent settings), the GAE estimate degrades and may be worse than a simpler Monte Carlo return.

Normalising advantages can hurt on multimodal reward distributions. Dividing advantage estimates by their batch standard deviation (a common implementation detail in PPO) stabilises learning on unimodal reward distributions. On tasks where a few trajectories yield extremely large positive returns and most yield zero, normalisation can wash out the signal from those rare successes.

The Policy Gradient Problem With Variance

The Baseline: Bias-Free Variance Reduction

Practical Baselines and the Actor-Critic Family

Implementing a Value-Function Baseline

When It Falls Down

Further Reading