Policy Methods vs Value Methods

Two researchers sit down to solve the same Atari game. One trains a network to output action probabilities and nudges them toward higher returns. The other trains a network to estimate the value of every state-action pair and picks greedily from those estimates. Both agents can eventually master the game, but they will behave differently under a moving distribution, struggle on opposite failure modes, and scale in opposite directions with action-space size. The choice between these two philosophies is not cosmetic; it permeates architecture, hyperparameter sensitivity, convergence guarantees, and engineering complexity.

What Each Camp Is Actually Doing

Value-based methods parameterise a value function - either a state value V(s) or a state-action value Q(s, a) - and derive a policy implicitly by acting greedily (or epsilon-greedily) with respect to those estimates. The canonical representative is DQN (Mnih et al., 2013), which fits a neural network to the Bellman target:

M : (s_t, a_t) --> (s_{t+1}, r_t)

The policy is never explicitly represented: it is simply argmax_a Q(s, a). The network is trained by minimising a regression loss, not by directly optimising return.

Policy-based methods maintain an explicit parameterisation π_θ(a | s) and optimise the expected return J(θ) directly:

for each real step:
    observe (s, a, r, s')
    update M with (s, a, r, s')
    for k in range(K):
        s_sim  = sample from replay buffer
        a_sim  = policy(s_sim)
        r_sim, s'_sim = M(s_sim, a_sim)
        update policy/value with (s_sim, a_sim, r_sim, s'_sim)

where G_t is a return estimate. REINFORCE (Williams, 1992) is the simplest form. PPO (Schulman et al., 2017) is the most widely deployed variant today, using a clipped surrogate objective to keep gradient updates stable.

The practical implication of this split is immediate. Value methods produce deterministic (or near-deterministic) policies at test time; policy methods can represent stochastic policies and are required to do so in partially observed or mixed-strategy environments.

Action Spaces and Scalability

Value methods depend on computing argmax_a Q(s, a). When actions are discrete and few (say, 18 Atari buttons), this is trivial. When actions are continuous - joint torques on a robot, bid prices in an auction, audio sample amplitudes - the argmax is intractable unless you discretise (losing resolution) or use a separate maximisation step such as the actor network in DDPG (Lillicrap et al., 2015).

Policy methods handle continuous actions naturally: the policy network outputs a mean and variance for a Gaussian, and you sample or take the mode at deployment. There is no maximisation over action space.

Property	Value-based (e.g. DQN)	Policy-based (e.g. PPO)
Action space	Discrete, small	Discrete or continuous
Policy representation	Implicit, deterministic	Explicit, can be stochastic
Update target	Regression on Bellman target	Gradient ascent on return
Sample efficiency	High (off-policy replay)	Moderate (on-policy)
Stability	Sensitive to target staleness	Sensitive to step size
Variance	Low (value estimates)	High (policy gradient)

Off-Policy vs On-Policy and Sample Efficiency

One of the most consequential downstream differences is the on-policy / off-policy distinction.

Value methods can be trained off-policy: you store transitions in a replay buffer and learn from data collected under any past behaviour policy. DQN's experience replay and target network are engineering solutions to the instability this introduces, but the core benefit - data reuse - is real. You can learn from a million stored transitions even if the current policy has drifted far from the collector.

Most policy gradient methods are inherently on-policy: the gradient estimator ∇_θ log π_θ(a | s) · G_t is valid only when (s, a) was sampled from the current π_θ. Reusing old data requires importance-sampling corrections, which increase variance rapidly as policies diverge. PPO approximates multiple gradient steps on the same batch by constraining how far the policy can move (via the clipped ratio r_t(θ)), but it still discards data after a small number of epochs.

This makes value methods considerably more sample-efficient in environments where simulator throughput is the bottleneck. On-policy policy methods compensate with better stability and the ability to operate on stochastic policies, which matters in sparse-reward or adversarial settings.

Actor-Critic: Marrying Both

Actor-critic architectures (Mnih et al., 2016, A3C; and the broader family including SAC, TD3) hybridise the two philosophies to get the best of both. The actor is a policy π_θ that is updated via policy gradients. The critic is a value function V_φ or Q_φ that is updated via temporal-difference learning and provides a low-variance baseline or advantage estimate for the actor update:

?wzxhzdk:2?

Without the critic, pure policy gradients suffer from high variance because G_t is a full Monte-Carlo return. Without the actor, pure value methods cannot represent stochastic policies or handle continuous action spaces gracefully. The critic's job is to reduce gradient variance; the actor's job is to represent a flexible, differentiable policy. They depend on each other's accuracy.

A subtle point: the critic is still a value method under the hood. All the pathologies of bootstrapped value estimation still apply to it. Actor-critic methods do not eliminate the tension between the two camps; they inherit it from both sides simultaneously.

When It Falls Down

Value methods break when the value function is non-smooth or multimodal. If two actions have nearly identical Q-values, small estimation errors flip the greedy policy completely, causing oscillations that never converge. Deadly triad issues (function approximation + bootstrapping + off-policy data) can cause Q-values to diverge rather than converge, a failure mode with no clean fix beyond careful tuning (target networks, clipped double Q, prioritised replay).

Policy gradient methods break under large or ill-conditioned policy updates. A step size that works for the first 10 iterations can catastrophically collapse performance when the policy moves to a new region of parameter space. TRPO and PPO were designed specifically to mitigate this, but they do not eliminate it; they trade step-size brittleness for sensitivity to the clip hyperparameter ε. In sparse-reward environments, policy gradients also suffer from the "signal attribution" problem: a long trajectory with a single reward at the end produces an extremely noisy gradient.

Stochastic vs deterministic environments expose each camp's assumptions. Value methods assume the optimal policy is deterministic (it is, in fully observed MDPs by the policy improvement theorem). In partially observed settings or multi-agent games where mixed strategies are optimal, a deterministic policy learned from Q-values will be exploited. Policy methods can represent the mixed strategy but will need enough entropy regularisation to find it.

Continuous action spaces with value methods require a separate maximisation step, adding its own instability. DDPG's actor can become stuck in local optima of the Q surface, especially early in training when Q estimates are poor. The actor-critic in SAC adds entropy regularisation to escape this, but introduces a new temperature hyperparameter.

The fundamental truth is that neither camp has a universally dominant algorithm. The choice depends on action space type, environment stochasticity, simulator throughput, and how much engineering budget you have for stabilisation tricks. Most production RL systems today use actor-critic variants that implicitly acknowledge both sides are necessary.

What Each Camp Is Actually Doing

Action Spaces and Scalability

Off-Policy vs On-Policy and Sample Efficiency

Actor-Critic: Marrying Both

When It Falls Down

Further Reading