Q-Learning

In 1992, Christopher Watkins proved that a simple update rule applied repeatedly to observed transitions would converge to the optimal action-value function in any finite Markov decision process, given enough exploration and a decaying learning rate. No environment model required. No planning required. Just experience and arithmetic.

That result is Q-learning, and it remains one of the cleanest ideas in all of machine learning.

The problem it solves

An agent navigates a world, taking actions, receiving rewards, and landing in new states. The goal is to discover a policy - a mapping from states to actions - that maximises total discounted future reward. The naive approach would require a model of transition probabilities and reward distributions, then solve for the policy analytically (dynamic programming). Q-learning sidesteps this entirely.

The key quantity is the action-value function Q(s, a): the expected total discounted return when starting in state s, taking action a, and thereafter following the optimal policy. If you had Q, picking the best action from any state would be trivial: just choose argmax_a Q*(s, a).

Q-learning learns an approximation to Q* directly from transitions.

The update rule

A single transition (s, a, r, s') tells you something about Q(s, a). The immediate reward r plus the discounted best-case value from the next state gives you a target:

target = r + γ · max_a' Q(s', a')

The update nudges Q(s, a) towards this target with a step size α:

Q(s, a) ← Q(s, a) + α · [r + γ · max_a' Q(s', a') − Q(s, a)]

The term in brackets is the temporal-difference (TD) error: the gap between what Q currently predicts and what the observed transition suggests. Minimise that gap repeatedly and, under standard conditions (the Robbins-Monro conditions: Σα = ∞, Σα² < ∞), Q converges to Q*.

A concrete trace through a small example clarifies the mechanics. Suppose a two-state gridworld where the agent can move Left or Right. Initialise Q(s, a) = 0 everywhere. After one transition (s=A, a=Right, r=1, s'=B):

target = 1 + 0.9 · max_a' Q(B, a')  =  1 + 0.9 · 0  =  1.0
Q(A, Right) ← 0 + 0.1 · (1.0 − 0)  =  0.1

On the next visit, max_a' Q(B, a') will be non-zero if B has been visited, and the signal propagates backward through the state space. This is bootstrapping: using your current estimate to sharpen itself.

Off-policy learning

One subtle but important property: Q-learning is off-policy. The target uses max_a' Q(s', a'), the greedy action in the next state, regardless of what action the agent actually took. The agent's behaviour policy (typically ε-greedy: random with probability ε, greedy otherwise) need not match the policy being evaluated.

This has a practical payoff. You can learn Q* from experience collected by a different policy, including experience stored in a replay buffer from much earlier in training. That reuse is exactly what Deep Q-Networks (DQN, Mnih et al., 2013) exploit: a large circular buffer of past transitions, sampled uniformly to break the temporal correlations that would otherwise destabilise neural-network training.

Property	Q-learning	SARSA
On/off policy	Off-policy	On-policy
Target action	max over all actions	Action actually taken
Cliff-walking behaviour	Walks near edge (optimal)	Avoids edge (safe under ε-greedy)
Convergence	To Q*	To Q^π (current policy)

Scaling with function approximation

Tabular Q-learning stores one value per (s, a) pair. For even modest continuous state spaces, this is impossible. The natural extension is to parameterise Q(s, a; θ) with a function approximator, usually a neural network, and perform gradient descent on the TD error.

The DQN paper (arXiv:1312.5602) made this work reliably by adding two stabilising tricks:

Experience replay. Transitions are stored in a replay buffer; mini-batches are drawn uniformly, breaking correlations.
Target network. A separate, periodically-updated copy of the network generates the targets, preventing the moving-target problem where chasing a shifting Q estimate causes divergence.

Both tricks are workarounds for the same underlying tension: neural networks trained with bootstrapped targets can diverge, because the target depends on the very weights being updated.

When it falls down

Maximisation bias. The update uses max_a' Q(s', a'), which systematically overestimates Q* when Q values have estimation noise. In stochastic environments this bias can be severe. Double Q-learning (van Hasselt et al., arXiv:1509.06461) fixes this by separating action selection from action evaluation using two independent estimators.

Large or continuous action spaces. Computing max_a' Q(s', a') requires evaluating Q over every possible action. In continuous action spaces this is intractable without special structure. Policy gradient methods, which directly parameterise a policy, handle continuous actions naturally and sidestep this bottleneck entirely.

Sample inefficiency. Tabular Q-learning can require millions of environment steps to converge, even for small problems. Off-policy replay helps, but the fundamental issue is that each transition provides one scalar of signal. Model-based approaches that plan against a learned dynamics model are substantially more sample-efficient.

Non-stationarity and partial observability. Q-learning assumes a Markov state - the current observation is sufficient for optimal decisions. When the environment is partially observable (a POMDP), the Markov assumption breaks and Q-learning on raw observations gives a policy that is suboptimal in a hard-to-diagnose way. Recurrent network extensions (DRQN) address this partially, at considerable complexity cost.

Deadly triad. Combine function approximation + bootstrapping + off-policy learning and convergence guarantees disappear. All three are present in DQN. The system works in practice for many problems, but there are well-documented failure cases where the Q-estimates diverge or oscillate. This is not a solved problem.

The problem it solves

The update rule

Off-policy learning

Scaling with function approximation

When it falls down

Further reading