On-Policy vs Off-Policy Learning

Q-learning and SARSA solve the same problem. Run them both on a cliff-walking grid. After enough training, SARSA takes the safe inland path; Q-learning charges along the cliff edge and falls into the pit repeatedly during training. Both converge to roughly the same optimal policy, but one of them nearly kills the agent every episode while learning it. The difference is a single choice: which policy generated the data you are learning from?

The core distinction

Every RL algorithm trains a target policy - the policy you ultimately care about. Some algorithms also implicitly or explicitly maintain a behaviour policy - the policy that actually interacts with the environment to collect experience.

On-policy: behaviour policy = target policy. You learn about the policy you are currently executing.
Off-policy: behaviour policy != target policy. You collect data under one policy and improve a different one.

SARSA is on-policy. At every step it updates using the action the agent actually took under the current (epsilon-greedy) policy:

D = A × B + C

a' is sampled from the current policy, so the Bellman target is consistent with the distribution that generated the transition.

Q-learning is off-policy. The update uses max Q(s', *) regardless of what the epsilon-greedy policy would have chosen:

// Minimal WMMA sketch (Volta, fp16 inputs, fp32 accumulator)
#include <mma.h>
using namespace nvcuda::wmma;

fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float>         c_frag;

load_matrix_sync(a_frag, a_ptr, lda);
load_matrix_sync(b_frag, b_ptr, ldb);
fill_fragment(c_frag, 0.0f);
mma_sync(c_frag, a_frag, b_frag, c_frag);
store_matrix_sync(d_ptr, c_frag, ldd, mem_row_major);

The greedy max is the target policy; the epsilon-greedy collector is the behaviour policy. Their objectives have diverged.

Why off-policy is so appealing

Off-policy learning buys you three things that on-policy cannot easily provide.

Sample reuse. If you collect a transition (s, a, r, s'), an on-policy algorithm can use it precisely once - after one gradient step, the current policy has shifted, so that transition is "stale" with respect to the new policy. An off-policy algorithm can store transitions in a replay buffer and sample them many times. DQN (Mnih et al., 2013) showed this single change - combined with a target network - was enough to stabilise deep Q-learning on Atari from raw pixels. Sample efficiency improves dramatically when environment interaction is expensive.

Behaviour diversity. You can learn from human demonstrations, scripted explorers, old checkpoints, or even random rollouts. The behaviour policy simply has to cover the state-action pairs the target policy cares about (a condition called coverage). This is invaluable in robotics: you collect robot teleop data with a human operator (behaviour) and train a learned policy (target) from it.

Parallelism. In IMPALA (Espeholt et al., 2018), hundreds of actor processes asynchronously fill a shared replay queue while a central learner updates the target network. By the time a batch of transitions reaches the learner, the actors have moved on to newer policy versions. V-trace importance-weighted corrections compensate for this lag, but the architecture is fundamentally off-policy.

Property	On-Policy	Off-Policy
Data freshness requirement	Must be current policy	Any covering behaviour
Sample efficiency	Lower (single use)	Higher (replay)
Convergence guarantees	Stronger (tabular)	Requires coverage + corrections
Typical algorithms	SARSA, PPO, A2C	Q-learning, DQN, DDPG, SAC
Exploration entanglement	Yes - explore to learn	Separable

How importance sampling bridges the gap

When the behaviour policy b and the target policy pi differ, a raw Monte Carlo return estimated under b is biased as an estimate for pi. Importance sampling (IS) corrects this:

?wzxhzdk:2?

The IS ratio rho_t re-weights each transition by how likely the target policy would have taken that action relative to the behaviour policy. For multi-step returns, the ratios multiply across the trajectory, causing variance explosion - the product of many ratios can be enormous or vanishingly small.

Practical algorithms truncate or clip these ratios. PPO (Schulman et al., 2017) clips the surrogate objective ratio to [1 - epsilon, 1 + epsilon], which controls the policy update step size and implicitly handles the stale-data problem when doing multiple gradient epochs on a single batch. V-trace uses a per-step clipped IS ratio c_t = min(c_bar, rho_t) to bound variance while preserving the fixed-point guarantees that make the algorithm stable under asynchronous policy lag.

The conceptual point: off-policy is not a free lunch. Every off-policy algorithm either restricts how far the behaviour and target policies can diverge, or pays an explicit variance cost to correct the distribution mismatch.

On-policy in practice: PPO and the data freshness trade-off

PPO is on-policy in the sense that its correctness proof assumes the rollout data matches the current policy. However, it does something pragmatic: it collects a batch of rollouts, then runs several mini-batch SGD epochs on that same batch. By the third or fourth epoch, the updated policy has drifted slightly from the one that generated the data. The clipped surrogate objective is the engineering patch that keeps this drift small enough that the on-policy assumption is not catastrophically violated.

This reveals that the on-policy vs off-policy distinction is not binary in practice. Real algorithms sit on a spectrum. PPO is "nearly on-policy" by design. DDPG (Lillicrap et al., 2015) is "fully off-policy" with a large replay buffer and slow-moving target networks. SAC sits between them, using entropy regularisation to ensure the behaviour policy remains close to the learned Gaussian. The conceptual framework is a clean binary; production systems are always approximating it.

When it falls down

Off-policy + function approximation is theoretically fragile. The "deadly triad" (Sutton & Barto, 2018) describes how combining function approximation, bootstrapping, and off-policy updates can cause divergence. DQN sidesteps this with frozen target networks; without that stabiliser, naive off-policy deep Q-learning oscillates badly. The guarantees that exist in tabular settings - where Q-learning converges under standard step-size conditions - do not carry over to neural networks automatically.

Coverage violations cause silent failure. If the behaviour policy never visits a state-action pair that the target policy needs to learn about, the replay buffer contains no information about it. The agent will confidently extrapolate from nearby states, often disastrously. This is particularly acute in robotics manipulation tasks where a random exploration policy rarely produces the precise fingertip contacts that matter.

Stale data is insidiously harmful. A replay buffer populated early in training contains transitions from a poor, early policy. If the buffer does not age out old data, the target network may keep learning from transitions that are no longer representative of how the current policy sees the world. Prioritised experience replay and reservoir sampling strategies mitigate this, but they add hyperparameter surface.

On-policy has its own pathology: sample hunger. PPO on a complex robot locomotion task may require tens of millions of environment steps. Every policy update discards the old rollouts. If simulation is cheap, this is acceptable. If environment interaction is expensive (real robots, patient trials, expensive compute), the data inefficiency can be prohibitive.

The core distinction

Why off-policy is so appealing

How importance sampling bridges the gap

On-policy in practice: PPO and the data freshness trade-off

When it falls down

Further reading