Policy Gradients and REINFORCE

Value-based methods like Q-learning work well when you can enumerate or discretise the action space, but break down the moment you want a robot arm to choose a continuous joint torque or a language model to assign probability over thousands of tokens. Policy gradient methods were built for exactly this situation: instead of learning which action is best, you learn a parameterised probability distribution over actions and nudge its parameters in the direction that increases expected reward.

The core idea: differentiating through expected return

Let the policy be a parameterised distribution \(\pi_\theta(a \mid s)\). The objective is to maximise the expected return over trajectories \(\tau = (s_0, a_0, r_0, s_1, \ldots)\):

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ R(\tau) \right]\]

To do gradient ascent on \(J\), you need \(\nabla_\theta J(\theta)\). The problem is that the expectation itself depends on \(\theta\) (through the sampling distribution), so you cannot simply push the gradient inside a fixed expectation.

The policy gradient theorem resolves this. Using the log-derivative (or score function) trick:

\[\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[ R(\tau) \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right]\]

The identity \(\nabla_\theta \log p = \nabla_\theta p / p\) lets you convert a gradient of an expectation into an expectation of a gradient, which you can then estimate with Monte Carlo samples. The policy does not need to be differentiable with respect to outcomes, only with respect to its own parameters.

Why this matters: the estimator is unbiased. You can collect a rollout, compute the return, multiply by the sum of log-probability gradients, and take a gradient step. No model of the environment required. No Bellman backup required.

REINFORCE: the simplest instantiation

Ronald Williams formalised this into the REINFORCE algorithm in 1992. The update rule for a single trajectory is:

\[\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t\]

where \(G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k\) is the discounted return from step \(t\) onwards (reward-to-go). Using reward-to-go rather than the full trajectory return reduces variance without introducing bias, because future actions cannot causally affect past rewards.

A minimal PyTorch implementation looks like this:

# Collect one episode
log_probs, rewards = [], []
state = env.reset()
done = False
while not done:
    dist = Categorical(policy_net(state))
    action = dist.sample()
    log_probs.append(dist.log_prob(action))
    state, reward, done, _ = env.step(action.item())
    rewards.append(reward)

# Compute discounted returns (reward-to-go)
G, returns = 0.0, []
for r in reversed(rewards):
    G = r + gamma * G
    returns.insert(0, G)
returns = torch.tensor(returns)

# Gradient ascent step
loss = -torch.stack(log_probs) * returns   # negative: we minimise
loss.sum().backward()
optimiser.step()

dist.log_prob(action) is the score function term. Multiplying it by the return and negating converts maximisation of \(J\) into the minimisation that optimisers expect. PyTorch's distributions module handles the \(\log \pi\) computation for any distribution type (Categorical for discrete, Normal for continuous).

Baselines and the variance problem

REINFORCE's estimator is unbiased, but its variance is typically enormous. A trajectory that happens to score a large return pushes up the probabilities of every action in it, including irrelevant or mildly harmful ones. Across sparse-reward environments, learning can take millions of episodes because the signal is buried in noise.

The standard fix is to subtract a baseline \(b(s_t)\) from the return:

\[\nabla_\theta J \approx \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - b(s_t))\]

Subtracting a baseline that depends only on the state leaves the estimator unbiased (the baseline's expected gradient contribution is zero by the EGLP lemma), but reduces variance by centring the signal around what the agent already expects. The optimal baseline in the mean-squared-error sense is the value function \(V^\pi(s_t)\). When you train a separate network to approximate this and use it as a baseline, you have an actor-critic architecture.

Variant	Baseline	Bias	Variance
Vanilla REINFORCE	None	0	Very high
REINFORCE with mean baseline	Episode mean return	0	Moderate
Actor-critic (one-step)	\(V(s_t)\)	Small (bootstrap bias)	Low
GAE (Schulman et al. 2016)	\(V^\lambda\)-weighted	Tunable	Very low

From REINFORCE to modern policy optimisation

REINFORCE demonstrated that direct policy search is viable, but two practical problems remained: sample efficiency and stability.

Sample efficiency: each collected trajectory is used for a single gradient update, then discarded, because the estimator is only valid under the current policy (on-policy). This makes REINFORCE expensive: you sample, update, discard, repeat.

Update instability: a poorly scaled gradient step can collapse the policy to a degenerate distribution, and there is no natural recovery mechanism.

TRPO (2015) addressed stability by constraining updates to a trust region defined by the KL divergence between old and new policy. PPO (Schulman et al. 2017, arXiv:1707.06347) simplified TRPO by replacing the hard constraint with a clipped surrogate objective that is cheap to compute:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) A_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t\right)\right]\]

where \(r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)\) is the importance ratio and \(A_t\) is the estimated advantage. The clip operation stops any single update from moving the policy too far, while importance sampling allows re-use of old data for a few gradient steps.

PPO is the practical workhorse that REINFORCE's theory enabled: the RLHF fine-tuning step in GPT-4, Claude, and Gemini all rely on a PPO-like loop, treating human preference scores as the reward signal.

When it falls down

Sparse rewards. If reward appears only at episode termination (e.g., winning a game), REINFORCE receives no gradient signal until an episode completes. Random exploration rarely stumbles onto rewarding trajectories, and without reward, the gradient is zero everywhere. Reward shaping, curiosity-driven exploration, or hierarchical methods are needed to proceed.

Continuous, high-dimensional action spaces. Variance scales with trajectory length times action dimensionality. A 100-step episode with a 20-dimensional continuous action space produces a gradient estimate with noise that swamps the signal unless heavy baselines (GAE, value networks) or normalisation are applied.

Sensitivity to hyperparameters. Learning rate, discount factor \(\gamma\), and baseline architecture all interact. Too large a step in a flat reward landscape can land the policy in a collapse region; too small a step and learning stalls. PPO's clip parameter \(\varepsilon\) provides a softer guard rail, but still requires tuning.

Non-stationarity in actor-critic. When the baseline (value network) is trained simultaneously with the policy, the target for the value function is moving. This can cause both networks to chase each other's updates in a feedback loop, particularly without careful learning-rate scheduling.

Credit assignment over long horizons. Reward-to-go partially addresses this, but attributing credit across hundreds of timesteps remains hard. A reward received at \(t=500\) provides an extremely weak gradient signal for actions at \(t=0\) under any finite discount factor.

The core idea: differentiating through expected return

REINFORCE: the simplest instantiation

Baselines and the variance problem

From REINFORCE to modern policy optimisation

When it falls down

Further reading