Value Functions and the Bellman Equations

Richard Bellman in 1957 needed a way to collapse an infinite-horizon decision problem into something tractable. His insight was that the value of being in a state can always be written in terms of the value of the next state. That single recursive observation is the foundation on which Q-learning, TD-learning, DDPG, and modern RLHF reward modelling all rest.

What a value function actually measures

An agent operating in a Markov Decision Process (MDP) receives a reward r_t at each timestep and discounts future rewards by a factor gamma (0 <= gamma < 1). The cumulative return from time t is:

G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...

A state-value function V(s) under policy pi is the expected return when starting from state s and following pi thereafter:

V^pi(s) = E_pi [ G_t | S_t = s ]

An action-value function Q(s, a) extends this by also conditioning on the first action taken:

Q^pi(s, a) = E_pi [ G_t | S_t = s, A_t = a ]

The distinction matters in practice. V(s) is used in actor-critic methods to compute a baseline; Q(s, a) is used in Q-learning and DQN to select actions directly without needing a model of the environment. Both encode the same underlying quantity - long-run return - but from different perspectives on the decision.

The Bellman expectation equations

Because the return decomposes recursively, V^pi satisfies a fixed-point equation:

V^pi(s) = sum_a pi(a|s) * sum_{s', r} p(s', r | s, a) * [ r + gamma * V^pi(s') ]

In plain terms: the value of state s equals the expected immediate reward plus the discounted expected value of the successor state, averaged over the policy and transition dynamics. This is the Bellman expectation equation for V.

The corresponding equation for Q is:

Q^pi(s, a) = sum_{s', r} p(s', r | s, a) * [ r + gamma * sum_{a'} pi(a'|s') Q^pi(s', a') ]

These equations are not merely definitions; they are consistency constraints. Any function that satisfies them for all (s, a) under policy pi is the value function for that policy. This turns the estimation problem into a root-finding problem.

The Bellman optimality equations

For control (not just evaluation), we want the optimal value functions:

V*(s) = max_a sum_{s', r} p(s', r | s, a) * [ r + gamma * V*(s') ]

Q*(s, a) = sum_{s', r} p(s', r | s, a) * [ r + gamma * max_{a'} Q*(s', a') ]

The max operator replaces the policy average. Notice that once you have Q*, the optimal policy is greedy: pi*(s) = argmax_a Q*(s, a). You do not need explicit transition probabilities at decision time, which is why Q-learning is a model-free algorithm.

The table below contrasts the two equation families:

	Expectation form	Optimality form
Operator	Average over pi	Max over a
Use case	Policy evaluation	Control
Fixed-point	Unique for given pi	Unique (under mild MDP conditions)
Algorithm family	TD(0), policy gradient baselines	Q-learning, value iteration

From equations to algorithms

The Bellman equations are not directly solvable in large state spaces (think Atari pixels or robotic joint configurations). RL algorithms approximate the solution iteratively.

Temporal Difference (TD) learning takes a single-step sample and treats the right-hand side as a learning target:

# TD(0) update for V
target = r + gamma * V(s_next)          # "TD target"
delta  = target - V(s)                  # "TD error"
V(s)  += alpha * delta                  # gradient step

The TD error delta is the residual in the Bellman equation for that sample. Minimising delta^2 over many samples drives V toward V^pi. Q-learning does the same for Q* by using max a' Q(s', a') in the target.

DQN (Mnih et al., 2013) applied this idea with a deep convolutional network as the function approximator, two stabilising tricks (experience replay and a lagged target network), and reached human-level performance across Atari games. The target network is essential: without it, both the prediction and the bootstrap target move simultaneously, destabilising training.

When it falls down

Deadly triad. Sutton and Barto's term for the combination of (1) function approximation, (2) bootstrapping from estimated values, and (3) off-policy data. Each alone is fine; together they can cause value estimates to diverge to infinity. DQN avoids part of this by using prioritised experience replay and careful target-network lag, but the instability risk is never fully eliminated.

Discount sensitivity. The gamma parameter is a design choice with substantial downstream effects. A gamma near 1 makes the agent care about distant rewards but also makes the contraction factor of the Bellman operator (gamma) close to 1, slowing convergence and amplifying estimation noise. A low gamma produces myopic behaviour. There is no universally correct value.

Deterministic vs. stochastic transitions. In highly stochastic environments, the variance of TD targets can be very large. Multi-step returns (TD(n) or TD(lambda)) reduce bias but increase variance; this trade-off requires tuning.

Continuous action spaces. The max_a Q(s, a) step is trivial when actions are discrete but computationally expensive or intractable when actions are continuous. DDPG (Lillicrap et al., 2015) addresses this by learning a separate policy network to approximate the argmax, but introduces its own training instabilities.

Partial observability. The Bellman equations assume the Markov property: p(s' | s, a) depends only on the current state, not on history. Real environments often violate this. Wrapping observations in recurrent networks (LSTM, Transformer) is common but breaks the theoretical guarantees of standard Bellman analysis.

Reward sparsity. When rewards are rare, the TD target is 0 + gamma * V(s') for most steps, providing almost no learning signal. Value functions learn slowly or not at all. Reward shaping and hindsight experience replay are common engineering responses, but both risk introducing unintended biases.

What a value function actually measures

The Bellman expectation equations

The Bellman optimality equations

From equations to algorithms

When it falls down

Further reading