Foundations
Dynamic Programming for RL
Dynamic programming solves RL problems exactly by bootstrapping value estimates across states using the Bellman equations, but only when you have a perfect model of the environment.
intermediate · 7 min read
Richard Bellman coined "dynamic programming" in 1953 partly to obscure the mathematical nature of his work from a defence secretary who was hostile to research. The name stuck, and so did the insight: a decision problem with optimal substructure can be solved by systematically combining solutions to smaller sub-problems. In reinforcement learning, that insight crystallises into two algorithms - policy iteration and value iteration - that are provably optimal yet quietly impractical at scale.
The Setup: Why DP Works Here
A Markov Decision Process (MDP) gives you:
- A finite state space S and action space A
- A transition model P(s' | s, a): the probability of landing in state s' after taking action a in state s
- A reward function R(s, a, s')
- A discount factor gamma in [0, 1)
The key property is the Markov property: the future depends only on the current state, not on the history that led there. This is what makes bootstrapping valid. Once you trust that a state captures all relevant information, you can express the value of being in state s as a function of the values of its successors - and that recursion is exactly the Bellman equation.
The Bellman expectation equation for a policy pi:
V^pi(s) = sum_a pi(a|s) * sum_s' P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]
In matrix form: V^pi = R^pi + gamma * P^pi * V^pi, which has the closed-form solution V^pi = (I - gamma * P^pi)^{-1} R^pi. This is the policy evaluation step.
The Bellman optimality equation replaces the weighted average over actions with a max:
V*(s) = max_a sum_s' P(s'|s,a) [R(s,a,s') + gamma * V*(s')]
DP algorithms solve these equations iteratively.
Policy Iteration
Policy iteration alternates between two steps until convergence:
1. Policy Evaluation - fix policy pi, repeatedly apply the Bellman expectation update until V^pi converges:
# Pseudocode: policy evaluation
while True:
delta = 0
for s in states:
v = V[s]
V[s] = sum over a: pi(a|s) * sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])
delta = max(delta, |v - V[s]|)
if delta < theta: # convergence threshold
break
2. Policy Improvement - act greedily with respect to the current value function:
pi'(s) = argmax_a sum_s' P(s'|s,a) [R(s,a,s') + gamma * V(s')]
The policy improvement theorem guarantees that pi' is at least as good as pi (V^{pi'}(s) >= V^{pi}(s) for all s), and strictly better unless pi was already optimal. Since there are finitely many deterministic policies (|A|^|S| of them), the algorithm must terminate.
Convergence rate: policy iteration converges in a polynomial number of iterations in |S| and |A| - typically very few in practice (often under 20 for small MDPs). Each evaluation step itself requires multiple sweeps over all states.
Value Iteration
Value iteration collapses evaluation and improvement into a single Bellman optimality update, applied repeatedly:
# Pseudocode: value iteration
while True:
delta = 0
for s in states:
v = V[s]
V[s] = max over a: sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])
delta = max(delta, |v - V[s]|)
if delta < theta:
break
# Extract greedy policy
pi(s) = argmax_a: sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])
Unlike policy iteration, value iteration never maintains an explicit policy during the sweep - it just pushes value information backwards through the MDP one step at a time. Think of it as starting from terminal (or absorbing) states and propagating discounted future rewards backwards.
Comparison at a glance:
| Property | Policy Iteration | Value Iteration |
|---|---|---|
| Update target | Bellman expectation | Bellman optimality |
| Inner loop | Full evaluation per step | One sweep per step |
| Convergence | Finite (exact) | Asymptotic (to epsilon-optimal) |
| Iterations to converge | Few (exact policy switches) | More, but each is cheaper |
| Result | Exact optimal policy | Epsilon-optimal policy |
In practice the two often converge in similar wall-clock time on small MDPs. Value iteration is usually preferred for its simplicity.
Generalised Policy Iteration
Sutton and Barto's generalised policy iteration (GPI) is the unifying perspective: virtually every RL algorithm, including modern deep RL methods, can be viewed as some form of interleaving evaluation and improvement. Policy iteration runs evaluation to convergence before improving. Value iteration improves after a single evaluation step. TD learning and Q-learning are online, approximate variants of the same structure.
The convergence proof for DP rests on the Bellman operator being a contraction mapping under the sup-norm, with contraction factor gamma. Repeated application of a contraction converges geometrically to a unique fixed point - the optimal value function.
||T V - T V'||_inf <= gamma * ||V - V'||_inf
This is not an empirical observation - it is a theorem, and it is why DP is the bedrock from which approximate methods are judged.
When It Falls Down
No model, no DP. Both algorithms require P(s' | s, a) and R(s, a, s') in closed form. Real environments - robot joints, financial markets, video games - do not hand you their transition probabilities. Model-based RL can learn P, but estimation error propagates into value errors, and the guarantees weaken accordingly.
State-space explosion. The update sweeps cost O(|S|^2 * |A|) per iteration. A gridworld with 10x10 cells is trivial. An Atari game has roughly 10^{70} possible screen states. Even tabular Q-learning with experience replay (DQN, arXiv:1312.5602) sidesteps this by approximating V via a neural network - but then the contraction guarantee is gone.
Synchronous sweeps are wasteful. Updating every state equally ignores the fact that most states are rarely visited or irrelevant to the current policy. Asynchronous DP (in-place updates, prioritised sweeping) can focus computation on high-error states, but adds implementation complexity.
Continuous state/action spaces. Exact DP cannot represent a continuous value function in a table. You need function approximation, and the moment you introduce it, convergence can break - fitted value iteration is known to diverge in pathological cases even with linear function approximators.
Optimality is policy-optimal, not globally optimal. DP finds the optimal policy for the given reward function and transition model. If either is misspecified (which they always are, to some degree), the resulting policy optimises a proxy, not the real objective.
Further Reading
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed., Chapters 4 (Dynamic Programming) and 3 (Finite MDPs). Available at the authors' site; the MIT Press edition is the definitive reference.
- David Silver's UCL RL Course - Lecture 3: Planning by Dynamic Programming - concise derivations of policy and value iteration with worked grid-world examples.
- OpenAI Spinning Up: Key Concepts in RL - accessible walkthrough of MDPs, value functions, and the Bellman equations before moving to model-free methods.
- Mnih, V. et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602 - the paper that showed how to scale Q-learning (a model-free DP descendant) to high-dimensional state spaces using neural function approximation.