Temporal-Difference Learning

Every time you revise an estimate based on a newer estimate - before seeing the final outcome - you are doing temporal-difference learning. The idea is deceptively simple and yet it powers virtually every deep RL system in use today, from Atari-playing DQN to game-playing AlphaZero variants.

The bootstrapping problem Monte Carlo cannot solve

In Monte Carlo (MC) policy evaluation, you wait until the episode terminates, collect the total discounted return G_t, and then move the value estimate V(s_t) toward that return:

V(s_t) ← V(s_t) + α [G_t - V(s_t)]

This is unbiased: G_t is the true empirical return. But it requires complete episodes. In continuing tasks (robotics, trading, dialogue) there are no episodes, or they are so long that waiting for them makes learning prohibitively slow.

Dynamic programming (DP) sidesteps this by bootstrapping: it updates V(s) using the current estimate V(s') of the successor state. The Bellman consistency equation says that if V is correct everywhere, then:

V(s) = E[r + γ V(s')]

DP iterates this to convergence. The catch: it requires a complete model of the environment (transition probabilities and reward function). Most real problems do not have one.

TD learning keeps the bootstrapping from DP and the model-free sampling from MC. It is, in that sense, the synthesis of both.

TD(0): the one-step update

The simplest TD method, TD(0), produces the following update after each observed transition (s_t, a_t, r_{t+1}, s_{t+1}):

V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) - V(s_t)]

The quantity in brackets is the TD error (often written δ_t):

δ_t = r_{t+1} + γ V(s_{t+1}) - V(s_t)

Read it as: "I predicted V(s_t), but the transition just gave me evidence that the true value is closer to r_{t+1} + γ V(s_{t+1}). Adjust by α times the discrepancy."

This is a one-step lookahead. The target r_{t+1} + γ V(s_{t+1}) is called the TD target. Because it uses V(s_{t+1}) rather than the true V*(s_{t+1}), it is biased - but the bias shrinks as the estimates improve. Crucially, you can apply it after every single step, not after every episode.

Pseudo-trace of a gridworld update

# State space: 3x3 grid. Terminal state T at (2,2). γ=0.9, α=0.1
# Initial estimates: V(s) = 0 everywhere.

Step 1: s=(0,0) → a=right → r=0, s'=(0,1)
  δ = 0 + 0.9 * 0 - 0 = 0
  V(0,0) unchanged

Step 2: s=(0,1) → a=right → r=0, s'=(0,2)
  δ = 0  (all zeros)
  ...

Step k: s=(1,2) → a=down → r=+1, s'=T (terminal, V(T)=0)
  δ = 1 + 0.9 * 0 - 0 = 1.0
  V(1,2) ← 0 + 0.1 * 1.0 = 0.1   # first signal propagates back

Next episode: signal from V(1,2)=0.1 will propagate to V(0,2), etc.

The reward signal propagates backward one step per visit. Contrast with MC, where the entire return G_t = 1 would be assigned to every state visited in one episode.

Q-learning: TD applied to action-value functions

For control (choosing actions, not just evaluating a policy), we need Q(s, a) rather than V(s). The off-policy variant is Q-learning (Watkins, 1989):

Q(s_t, a_t) ← Q(s_t, a_t) + α [r_{t+1} + γ max_a Q(s_{t+1}, a) - Q(s_t, a_t)]

The key word is max: the target uses the greedy action in s_{t+1}, regardless of what the behaviour policy actually did. This makes Q-learning off-policy - you can learn the optimal value function while following an exploratory (e.g., epsilon-greedy) policy. That separation between behaviour and target policies is what made Q-learning so practically powerful.

SARSA is the on-policy counterpart: it uses the actual next action a_{t+1} drawn from the current policy:

Q(s_t, a_t) ← Q(s_t, a_t) + α [r_{t+1} + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]

SARSA is more conservative in stochastic or dangerous environments because it accounts for the exploration the policy will actually perform, not the best-case greedy action.

Method	Target	Policy	Bias vs. variance
MC	G_t (full return)	On-policy	Low bias, high variance
TD(0)	r + γ V(s')	On or off	Higher bias, lower variance
Q-learning	r + γ max_a Q(s', a)	Off-policy	Higher bias, lower variance
SARSA	r + γ Q(s', a')	On-policy	Higher bias, lower variance

n-step returns and the TD-MC spectrum

TD(0) and MC are the two ends of a spectrum. n-step TD uses returns that look n steps ahead before bootstrapping:

G_t^(n) = r_{t+1} + γ r_{t+2} + ... + γ^(n-1) r_{t+n} + γ^n V(s_{t+n})

Setting n=1 recovers TD(0); n=∞ recovers MC. Intermediate values often outperform both extremes in practice because they balance the low variance of bootstrapping with the lower bias of longer rollouts.

TD(λ) takes this further by using an exponentially weighted average of all n-step returns:

G_t^λ = (1 - λ) Σ_{n=1}^∞ λ^{n-1} G_t^(n)

The eligibility trace implementation makes this computationally equivalent to a single backward pass rather than storing all future rewards. λ=0 gives TD(0); λ=1 gives MC. Values around 0.7-0.9 frequently work well, though this is task-dependent.

The DQN paper (Mnih et al., 2013) used TD(0) Q-learning with a deep convolutional network as the function approximator, experience replay to decorrelate updates, and a separate target network to stabilise the bootstrap target. That combination learned to play seven Atari games at or above human level from raw pixels alone.

When it falls down

Deadly triad with function approximation. When you combine (1) off-policy learning, (2) bootstrapping, and (3) function approximation, you get what Sutton calls the "deadly triad." Q-learning with neural networks can diverge rather than converge. DQN mitigates this with experience replay and target networks, not eliminate it. When your function approximator has limited capacity or the feature representation is poor, divergence is a real risk.

Maximisation bias in Q-learning. The max operator in Q-learning systematically overestimates action values in stochastic environments. If Q(s', a) estimates are noisy, max_a Q(s', a) is biased upward. Double DQN (van Hasselt et al., 2015) addresses this by decoupling action selection from action evaluation using two separate networks, but the underlying bias source remains something to monitor.

Slow credit assignment in sparse-reward tasks. TD(0) propagates value information one step per visit. In a maze where reward appears only at the exit, it takes O(maze diameter) visits to every state before values near the start become accurate. Reward shaping, hindsight experience replay, or larger n in n-step TD can help, but there is no free lunch.

Partial observability breaks the Markov assumption. TD learning (like all value-function methods in standard form) assumes the state is Markovian: V(s) uniquely determines expected future returns. In partially observable environments (POMDPs), two observations that look identical may have different true values. Recurrent networks or belief-state representations are needed, but they reintroduce training instability.

On-policy methods can be sample-inefficient. SARSA and actor-critic methods with TD(0) critics must discard experience collected under old policies, since the bootstrap target depends on the current value function. This makes them expensive in environments where data collection is costly (real robots, clinical trials).