Deep Q-Networks

Atari's Breakout can be solved by a policy that is never explicitly programmed: in 2013, a single neural network learned to dig tunnels through the brick wall after roughly 600 episodes, eventually achieving superhuman scores on 29 of 49 games. The only input was 84x84 grayscale frames. The algorithm was Deep Q-Network (DQN), and the reason it worked where prior attempts failed comes down to two surprisingly simple engineering choices layered on top of classical Q-learning.

From Q-learning to DQN

Q-learning learns a state-action value function Q(s, a) that estimates the discounted return when taking action a in state s and following the optimal policy thereafter. The update rule is:

Q(s, a) <- Q(s, a) + alpha * [r + gamma * max_a' Q(s', a') - Q(s, a)]

where r is the reward received, s' is the next state, gamma is the discount factor, and the bracketed term is the temporal-difference (TD) error. In the tabular case this converges, but in any environment with a high-dimensional or continuous state space a table is not tractable. The natural fix is function approximation: parameterise Q(s, a; theta) with a neural network.

The problem is that naive gradient descent on

L(theta) = E[(r + gamma * max_a' Q(s', a'; theta) - Q(s, a; theta))^2]

is catastrophically unstable. The target r + gamma * max_a' Q(s', a'; theta) is computed using the same parameters theta that are being updated, so every weight update shifts both the prediction and the target simultaneously. Correlations between consecutive frames mean the network essentially chases its own tail.

DQN resolves this with two targeted interventions.

Experience Replay

Rather than learning from transitions in the order they arrive, DQN stores each transition (s, a, r, s') in a fixed-size circular buffer called the replay memory (capacity typically 1 million transitions). At each training step a minibatch is sampled uniformly at random from this buffer.

This achieves two things:

Problem	How replay fixes it
Temporal correlation between consecutive frames	Random sampling breaks the correlation so minibatches approximate i.i.d. data
Catastrophic forgetting of rare transitions	Old transitions stay in the buffer and keep being revisited
Sample inefficiency	Each transition is reused roughly 8 times on average before eviction

The cost is memory (roughly 7 GB for 1M raw Atari frames) and the fact that off-policy data must be used; Q-learning is off-policy by design, so this fits naturally.

The Target Network

The second stabilisation trick addresses the moving-target problem directly. DQN maintains two copies of the network: the online network with parameters theta (updated every step) and the target network with parameters theta^- (held frozen and periodically copied from theta, e.g., every 10,000 steps).

The loss becomes:

L(theta) = E[(r + gamma * max_a' Q(s', a'; theta^-) - Q(s, a; theta))^2]

Fixing theta^- makes the regression target approximately stationary for thousands of steps, turning what was an unstable coupled system into something resembling standard supervised learning. The tradeoff is that the targets lag reality, so the network initially chases a slightly outdated value signal; in practice this rarely hurts because the policy is changing slowly anyway.

The Network Architecture

For Atari, DQN uses a convolutional network that takes a stack of four consecutive 84x84 grayscale frames (to capture motion), passes them through three convolutional layers, and outputs a Q-value for each action in the discrete action set:

Input: (4, 84, 84) stacked frames
Conv1: 32 filters, 8x8, stride 4  -> ReLU
Conv2: 64 filters, 4x4, stride 2  -> ReLU
Conv3: 64 filters, 3x3, stride 1  -> ReLU
FC:    512 units                   -> ReLU
Output: |A| Q-values (one per action, e.g. 18 for Pong)

Outputting all Q-values in a single forward pass (rather than separately for each action) is a key efficiency decision: selecting the greedy action at inference time requires only a single forward pass followed by argmax.

Key Extensions

The vanilla DQN published in 2013 and formalised in Nature 2015 spawned a family of targeted improvements. Three are especially important:

Double DQN (van Hasselt et al., 2015): The max operator in the target uses the same network for both action selection and evaluation, causing systematic overestimation of Q-values. Double DQN decouples these: use the online network to select the action, use the target network to evaluate it.

target = r + gamma * Q(s', argmax_a' Q(s', a'; theta); theta^-)

This single change reduces overestimation bias and improves median human-normalised performance across 49 Atari games.

Prioritised Experience Replay (Schaul et al., 2015): Uniform sampling from the replay buffer wastes capacity on already-learned transitions. Prioritised replay samples proportionally to the absolute TD error, so surprising transitions are revisited more often. Importance-sampling weights correct for the resulting distribution shift.

Dueling Networks (Wang et al., 2015): Decompose Q(s, a) into a state value V(s) and an advantage A(s, a) via two separate network heads:

Q(s, a; theta) = V(s; theta_V) + (A(s, a; theta_A) - mean_a' A(s, a'; theta_A))

For states where all actions lead to similar outcomes, the network can learn V(s) efficiently without computing a separate advantage for every action.

When It Falls Down

Deadly triad instability. Combining function approximation, off-policy learning, and bootstrapping (the TD target uses a network estimate) creates conditions where Q-values can diverge rather than converge. In practice divergence is rare on benchmark environments but has been observed and is not fully characterised theoretically.

Sparse and delayed rewards. DQN relies on the reward signal to propagate credit backward through the Bellman equation. When rewards arrive only at the end of a long episode (e.g., many text-based games, sparse manipulation tasks), DQN's random exploration rarely encounters a reward at all and fails to make meaningful progress.

Discrete actions only. The argmax over Q-values requires a finite action set. Continuous control tasks (robotic manipulation, locomotion) require either discretising the action space at the cost of exponential blowup, or switching to actor-critic methods such as DDPG or SAC.

Distributional shift between replay and behaviour. Prioritised replay exacerbates this: the distribution of transitions in the buffer diverges from the current policy's visitation distribution as training progresses. For most Atari games this matters little, but in non-stationary environments or fine-grained control tasks it can cause instability.

No notion of epistemic uncertainty. DQN's exploration strategy (epsilon-greedy: choose random action with probability epsilon) is agnostic to what the network does and doesn't know. It explores uniformly, which is highly inefficient in large state spaces. Bayesian extensions (NoisyNets, bootstrapped DQN) address this but add complexity.

Partial observability. Frame stacking gives a shallow sense of history (4 frames), but any environment requiring memory beyond roughly 0.1 seconds of gameplay will confound DQN. Recurrent extensions (DRQN with LSTM) are needed for tasks with genuine partial observability.