Foundations
Value Functions and the Bellman Equations
Value functions assign expected cumulative reward to states and state-action pairs; the Bellman equations express these values as self-consistent recursive relationships that underpin every practical RL algorithm.
intermediate · 8 min read
Richard Bellman in 1957 needed a way to collapse an infinite-horizon decision problem into something tractable. His insight was that the value of being in a state can always be written in terms of the value of the next state. That single recursive observation is the foundation on which Q-learning, TD-learning, DDPG, and modern RLHF reward modelling all rest.
What a value function actually measures
An agent operating in a Markov Decision Process (MDP) receives a reward r_t at each timestep and discounts future rewards by a factor gamma (0 <= gamma < 1). The cumulative return from time t is:
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...
A state-value function V(s) under policy pi is the expected return when starting from state s and following pi thereafter:
V^pi(s) = E_pi [ G_t | S_t = s ]
An action-value function Q(s, a) extends this by also conditioning on the first action taken:
Q^pi(s, a) = E_pi [ G_t | S_t = s, A_t = a ]
The distinction matters in practice. V(s) is used in actor-critic methods to compute a baseline; Q(s, a) is used in Q-learning and DQN to select actions directly without needing a model of the environment. Both encode the same underlying quantity - long-run return - but from different perspectives on the decision.
The Bellman expectation equations
Because the return decomposes recursively, V^pi satisfies a fixed-point equation:
V^pi(s) = sum_a pi(a|s) * sum_{s', r} p(s', r | s, a) * [ r + gamma * V^pi(s') ]
In plain terms: the value of state s equals the expected immediate reward plus the discounted expected value of the successor state, averaged over the policy and transition dynamics. This is the Bellman expectation equation for V.
The corresponding equation for Q is:
Q^pi(s, a) = sum_{s', r} p(s', r | s, a) * [ r + gamma * sum_{a'} pi(a'|s') Q^pi(s', a') ]
These equations are not merely definitions; they are consistency constraints. Any function that satisfies them for all (s, a) under policy pi is the value function for that policy. This turns the estimation problem into a root-finding problem.
The Bellman optimality equations
For control (not just evaluation), we want the optimal value functions:
V*(s) = max_a sum_{s', r} p(s', r | s, a) * [ r + gamma * V*(s') ]
Q*(s, a) = sum_{s', r} p(s', r | s, a) * [ r + gamma * max_{a'} Q*(s', a') ]
The max operator replaces the policy average. Notice that once you have Q*, the optimal policy is greedy: pi*(s) = argmax_a Q*(s, a). You do not need explicit transition probabilities at decision time, which is why Q-learning is a model-free algorithm.
The table below contrasts the two equation families:
| Expectation form | Optimality form | |
|---|---|---|
| Operator | Average over pi | Max over a |
| Use case | Policy evaluation | Control |
| Fixed-point | Unique for given pi | Unique (under mild MDP conditions) |
| Algorithm family | TD(0), policy gradient baselines | Q-learning, value iteration |
From equations to algorithms
The Bellman equations are not directly solvable in large state spaces (think Atari pixels or robotic joint configurations). RL algorithms approximate the solution iteratively.
Temporal Difference (TD) learning takes a single-step sample and treats the right-hand side as a learning target:
# TD(0) update for V
target = r + gamma * V(s_next) # "TD target"
delta = target - V(s) # "TD error"
V(s) += alpha * delta # gradient step
The TD error delta is the residual in the Bellman equation for that sample. Minimising delta^2 over many samples drives V toward V^pi. Q-learning does the same for Q* by using max a' Q(s', a') in the target.
DQN (Mnih et al., 2013) applied this idea with a deep convolutional network as the function approximator, two stabilising tricks (experience replay and a lagged target network), and reached human-level performance across Atari games. The target network is essential: without it, both the prediction and the bootstrap target move simultaneously, destabilising training.
When it falls down
Deadly triad. Sutton and Barto's term for the combination of (1) function approximation, (2) bootstrapping from estimated values, and (3) off-policy data. Each alone is fine; together they can cause value estimates to diverge to infinity. DQN avoids part of this by using prioritised experience replay and careful target-network lag, but the instability risk is never fully eliminated.
Discount sensitivity. The gamma parameter is a design choice with substantial downstream effects. A gamma near 1 makes the agent care about distant rewards but also makes the contraction factor of the Bellman operator (gamma) close to 1, slowing convergence and amplifying estimation noise. A low gamma produces myopic behaviour. There is no universally correct value.
Deterministic vs. stochastic transitions. In highly stochastic environments, the variance of TD targets can be very large. Multi-step returns (TD(n) or TD(lambda)) reduce bias but increase variance; this trade-off requires tuning.
Continuous action spaces. The max_a Q(s, a) step is trivial when actions are discrete but computationally expensive or intractable when actions are continuous. DDPG (Lillicrap et al., 2015) addresses this by learning a separate policy network to approximate the argmax, but introduces its own training instabilities.
Partial observability. The Bellman equations assume the Markov property: p(s' | s, a) depends only on the current state, not on history. Real environments often violate this. Wrapping observations in recurrent networks (LSTM, Transformer) is common but breaks the theoretical guarantees of standard Bellman analysis.
Reward sparsity. When rewards are rare, the TD target is 0 + gamma * V(s') for most steps, providing almost no learning signal. Value functions learn slowly or not at all. Reward shaping and hindsight experience replay are common engineering responses, but both risk introducing unintended biases.
Further reading
- Mnih, V. et al. (2013). "Playing Atari with Deep Reinforcement Learning." arXiv:1312.5602. The original DQN paper; Sections 2-3 give a clean exposition of Bellman-based Q-learning with function approximation. https://arxiv.org/abs/1312.5602
- Lillicrap, T. P. et al. (2015). "Continuous control with deep reinforcement learning." arXiv:1509.02971. Extends Bellman-based Q-learning to continuous action spaces via actor-critic. https://arxiv.org/abs/1509.02971
- Francois-Lavet, V. et al. (2018). "An Introduction to Deep Reinforcement Learning." arXiv:1811.12560. Accessible survey covering value functions, Bellman operators, and their role in modern deep RL. https://arxiv.org/abs/1811.12560
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapters 3-4 derive both expectation and optimality forms of the Bellman equations from first principles. Available free at incompleteideas.net/book/.