Actor-Critic Methods

Pure policy gradient methods have a dirty secret: they work, but they are slow and wildly noisy. REINFORCE, the canonical policy gradient algorithm, updates the policy by scaling log-probabilities by a full episodic return. That return carries massive variance because a single lucky (or unlucky) trajectory can shift the gradient estimate by orders of magnitude. This is the problem actor-critic methods were built to solve.

The Bias-Variance Dilemma in Policy Gradients

A policy gradient estimator has the general form:

∇J(θ) ≈ (1/N) Σ_t ∇_θ log π_θ(a_t | s_t) · Ψ_t

The choice of Ψ_t is everything:

Choice of Ψ_t	Bias	Variance	Notes
Total return G_t	None	Very high	REINFORCE
Advantage A(s,a) = Q(s,a) - V(s)	None	Lower	Requires value estimate
TD residual r + γV(s') - V(s)	Some	Low	One-step actor-critic
GAE (λ-weighted)	Tunable	Tunable	Practical workhorse

The advantage function A(s, a) tells you how much better action a is relative to the average action from state s. It is a signed signal: positive means this action was better than expected, negative means it was worse. Using A instead of raw returns reduces variance substantially, because the baseline V(s) absorbs the part of the return that is predictable from the state alone.

The critic's job is to estimate V(s) (or Q(s, a), depending on architecture). The actor uses that estimate to compute advantages and update the policy.

The Two-Network Architecture

An actor-critic system trains two function approximators simultaneously:

Actor:   π_θ(a | s)        -- outputs action probabilities or distribution
Critic:  V_φ(s)  or Q_φ(s, a)  -- outputs scalar value estimate

Critic update. At each time step (or mini-batch), compute the TD target and regress the critic:

δ_t  = r_t + γ · V_φ(s_{t+1}) - V_φ(s_t)   # TD error = one-step advantage estimate
L_critic = δ_t²

Actor update. Use the TD error (or a more elaborate advantage estimate) to scale the policy gradient:

L_actor = -log π_θ(a_t | s_t) · δ_t.detach()

The .detach() is load-bearing: you do not want gradients flowing from the advantage estimate back through the critic during the actor update. The critic is a fixed estimator for this step, not a co-optimised participant.

In practice, when sharing a backbone (common in Atari-style networks), the two losses are combined:

L_total = L_actor + c_v · L_critic - c_e · H[π_θ]

where H[π_θ] is an entropy bonus that discourages premature policy collapse and c_v, c_e are tuning coefficients (typically 0.5 and 0.01 respectively in A3C/A2C).

Advantage Estimation in Practice: GAE

One-step TD errors are low-variance but biased because a bootstrapped value estimate carries whatever error the critic currently has. The other extreme, Monte Carlo returns, are unbiased but high-variance. Generalised Advantage Estimation (GAE, Schulman et al. 2016) interpolates with an exponential decay parameter λ:

A^GAE_t = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}

At λ=0, GAE collapses to the one-step TD error. At λ=1, it collapses to Monte Carlo returns minus the baseline. In practice, λ ∈ [0.9, 0.97] and γ ∈ [0.99, 0.999] are the usual operating range for continuous control tasks. PPO, which builds directly on actor-critic principles, inherits GAE as its default advantage estimator.

Asynchronous and Parallel Variants

The A3C paper (Mnih et al., ICML 2016) scaled actor-critic training by running multiple independent workers, each with its own environment copy, asynchronously sending gradient updates to a shared model. Two practical effects:

Decorrelated experience: workers explore different parts of the state space simultaneously, reducing the correlation problem that plagues single-environment on-policy methods.
No replay buffer required: the gradient diversity from parallel workers serves the same stabilising role that experience replay serves in DQN.

A2C (the synchronous variant) waits for all workers to finish a rollout before updating, then averages gradients. This is easier to reason about and often matches A3C's performance on GPU hardware where synchronous parallelism is cheaper.

The modern dominant form is PPO, which adds a clipped surrogate objective on top of the actor-critic framework to prevent destructively large policy updates:

L_CLIP = E_t [ min( r_t(θ) · A_t,  clip(r_t(θ), 1-ε, 1+ε) · A_t ) ]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the probability ratio. The clip keeps updates within a trust region without the second-order overhead of TRPO.

When It Falls Down

Critic lag corrupts the actor. The actor is only as good as the critic's advantage estimates. Early in training, the critic is poorly calibrated. Gradient updates based on a poor critic can point the actor in consistently wrong directions, creating a feedback loop: bad actor generates uninformative data, which keeps the critic poor. In practice this manifests as instability in the first few hundred thousand environment steps.

Shared backbone instability. Sharing layers between actor and critic saves parameters but creates conflicting gradient signals. The value regression loss pushes representations toward predicting scalar returns; the policy loss pushes toward action discrimination. With poorly tuned c_v, the critic loss can dominate and destroy the policy representation (or vice versa). Separate networks avoid this but are slower.

On-policy sample inefficiency. Actor-critic methods in their standard form are on-policy: once you update the policy, the rollouts used to compute that update become stale and must be discarded. This makes them expensive in environments where each environment step is costly (real-world robotics, slow simulators). Off-policy variants like Soft Actor-Critic (SAC) use a replay buffer to reuse experience, but require more careful handling of the importance-weight correction.

Entropy collapse and premature convergence. Without an entropy bonus, the actor tends to become deterministic prematurely, converging to a locally good but globally suboptimal policy. The entropy coefficient needs tuning per environment; too large and the policy never commits to good actions, too small and it collapses early.

Continuous action spaces and Gaussian policies. When the action space is continuous, the actor typically outputs mean and variance of a Gaussian. If the variance collapses to near zero before a good mean is found, the policy is stuck. Clipping the minimum standard deviation is a common patch, not a principled fix.

The Bias-Variance Dilemma in Policy Gradients

The Two-Network Architecture

Advantage Estimation in Practice: GAE

Asynchronous and Parallel Variants

When It Falls Down

Further Reading