Dynamic Programming for RL

Richard Bellman coined "dynamic programming" in 1953 partly to obscure the mathematical nature of his work from a defence secretary who was hostile to research. The name stuck, and so did the insight: a decision problem with optimal substructure can be solved by systematically combining solutions to smaller sub-problems. In reinforcement learning, that insight crystallises into two algorithms - policy iteration and value iteration - that are provably optimal yet quietly impractical at scale.

The Setup: Why DP Works Here

A Markov Decision Process (MDP) gives you:

A finite state space S and action space A
A transition model P(s' | s, a): the probability of landing in state s' after taking action a in state s
A reward function R(s, a, s')
A discount factor gamma in [0, 1)

The key property is the Markov property: the future depends only on the current state, not on the history that led there. This is what makes bootstrapping valid. Once you trust that a state captures all relevant information, you can express the value of being in state s as a function of the values of its successors - and that recursion is exactly the Bellman equation.

The Bellman expectation equation for a policy pi:

V^pi(s) = sum_a  pi(a|s) * sum_s'  P(s'|s,a) [R(s,a,s') + gamma * V^pi(s')]

In matrix form: V^pi = R^pi + gamma * P^pi * V^pi, which has the closed-form solution V^pi = (I - gamma * P^pi)^{-1} R^pi. This is the policy evaluation step.

The Bellman optimality equation replaces the weighted average over actions with a max:

V*(s) = max_a  sum_s'  P(s'|s,a) [R(s,a,s') + gamma * V*(s')]

DP algorithms solve these equations iteratively.

Policy Iteration

Policy iteration alternates between two steps until convergence:

1. Policy Evaluation - fix policy pi, repeatedly apply the Bellman expectation update until V^pi converges:

# Pseudocode: policy evaluation
while True:
    delta = 0
    for s in states:
        v = V[s]
        V[s] = sum over a: pi(a|s) * sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])
        delta = max(delta, |v - V[s]|)
    if delta < theta:  # convergence threshold
        break

2. Policy Improvement - act greedily with respect to the current value function:

pi'(s) = argmax_a  sum_s'  P(s'|s,a) [R(s,a,s') + gamma * V(s')]

The policy improvement theorem guarantees that pi' is at least as good as pi (V^{pi'}(s) >= V^{pi}(s) for all s), and strictly better unless pi was already optimal. Since there are finitely many deterministic policies (|A|^|S| of them), the algorithm must terminate.

Convergence rate: policy iteration converges in a polynomial number of iterations in |S| and |A| - typically very few in practice (often under 20 for small MDPs). Each evaluation step itself requires multiple sweeps over all states.

Value Iteration

Value iteration collapses evaluation and improvement into a single Bellman optimality update, applied repeatedly:

# Pseudocode: value iteration
while True:
    delta = 0
    for s in states:
        v = V[s]
        V[s] = max over a: sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])
        delta = max(delta, |v - V[s]|)
    if delta < theta:
        break

# Extract greedy policy
pi(s) = argmax_a: sum over s': P(s'|s,a) * (R(s,a,s') + gamma * V[s'])

Unlike policy iteration, value iteration never maintains an explicit policy during the sweep - it just pushes value information backwards through the MDP one step at a time. Think of it as starting from terminal (or absorbing) states and propagating discounted future rewards backwards.

Comparison at a glance:

Property	Policy Iteration	Value Iteration
Update target	Bellman expectation	Bellman optimality
Inner loop	Full evaluation per step	One sweep per step
Convergence	Finite (exact)	Asymptotic (to epsilon-optimal)
Iterations to converge	Few (exact policy switches)	More, but each is cheaper
Result	Exact optimal policy	Epsilon-optimal policy

In practice the two often converge in similar wall-clock time on small MDPs. Value iteration is usually preferred for its simplicity.

Generalised Policy Iteration

Sutton and Barto's generalised policy iteration (GPI) is the unifying perspective: virtually every RL algorithm, including modern deep RL methods, can be viewed as some form of interleaving evaluation and improvement. Policy iteration runs evaluation to convergence before improving. Value iteration improves after a single evaluation step. TD learning and Q-learning are online, approximate variants of the same structure.

The convergence proof for DP rests on the Bellman operator being a contraction mapping under the sup-norm, with contraction factor gamma. Repeated application of a contraction converges geometrically to a unique fixed point - the optimal value function.

||T V - T V'||_inf  <=  gamma * ||V - V'||_inf

This is not an empirical observation - it is a theorem, and it is why DP is the bedrock from which approximate methods are judged.

When It Falls Down

No model, no DP. Both algorithms require P(s' | s, a) and R(s, a, s') in closed form. Real environments - robot joints, financial markets, video games - do not hand you their transition probabilities. Model-based RL can learn P, but estimation error propagates into value errors, and the guarantees weaken accordingly.

State-space explosion. The update sweeps cost O(|S|^2 * |A|) per iteration. A gridworld with 10x10 cells is trivial. An Atari game has roughly 10^{70} possible screen states. Even tabular Q-learning with experience replay (DQN, arXiv:1312.5602) sidesteps this by approximating V via a neural network - but then the contraction guarantee is gone.

Synchronous sweeps are wasteful. Updating every state equally ignores the fact that most states are rarely visited or irrelevant to the current policy. Asynchronous DP (in-place updates, prioritised sweeping) can focus computation on high-error states, but adds implementation complexity.

Continuous state/action spaces. Exact DP cannot represent a continuous value function in a table. You need function approximation, and the moment you introduce it, convergence can break - fitted value iteration is known to diverge in pathological cases even with linear function approximators.

Optimality is policy-optimal, not globally optimal. DP finds the optimal policy for the given reward function and transition model. If either is misspecified (which they always are, to some degree), the resulting policy optimises a proxy, not the real objective.

The Setup: Why DP Works Here

Policy Iteration

Value Iteration

Generalised Policy Iteration

When It Falls Down

Further Reading