Markov Decision Processes

The problem before the framework

Imagine you are teaching a robot to navigate a warehouse. At every moment it can move forward, turn left, or turn right. Sometimes the floor is slippery and it drifts sideways even when told to go straight. Occasionally it bumps into a shelf and loses time. You want it to reach the dispatch bay as fast as possible, on average, despite the slippery floors.

How do you even write that down? "Reward good behaviour and punish bad" is a slogan, not a specification. What counts as a state? How does randomness enter? When does the problem end? A Markov Decision Process (MDP) answers all of these questions with one compact formalism, and virtually every reinforcement learning algorithm in existence is secretly solving an MDP.

The five-element definition

An MDP is the tuple (S, A, P, R, gamma):

Symbol	Name	What it captures
S	State space	Every distinct situation the agent can find itself in
A	Action space	Every choice available to the agent
P(s' \| s, a)	Transition function	Probability of landing in s' after taking action a in state s
R(s, a, s')	Reward function	Scalar signal received on each transition
gamma in [0, 1)	Discount factor	How much future rewards are worth relative to immediate ones

The Markov property is the load-bearing assumption: the next state depends only on the current state and the chosen action, not on the full history of how you got there. Formally,

P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ..., s_0) = P(s_{t+1} | s_t, a_t)

This assumption is what makes the problem tractable. If the full history mattered, the state space would grow exponentially with time. In the warehouse example, the robot's current position and battery level constitute the state; how it drove there is irrelevant.

Policies, returns, and what the agent is actually maximising

A policy pi maps states (or histories, for non-Markovian baselines) to actions. A deterministic policy is pi(s) = a; a stochastic policy is pi(a | s) = probability of choosing a in state s.

The return G_t is the discounted sum of future rewards from time t:

G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ...
    = sum_{k=0}^{inf} gamma^k * R_{t+k+1}

The discount factor gamma does two jobs at once: it keeps the infinite sum finite (since gamma < 1), and it encodes a preference for sooner rewards. With gamma = 0 the agent only cares about the next step; with gamma = 0.99 it plans almost a hundred steps ahead.

The agent's objective is to find a policy pi that maximises the expected return* from every starting state:

pi* = argmax_pi  E_pi [ G_t | s_t = s ]  for all s in S

Value functions and the Bellman equations

Rather than optimising pi directly, most algorithms build up a value function that scores each state (or state-action pair) under a given policy.

The state-value function V^pi(s) is the expected return when starting in s and following pi:

V^pi(s) = E_pi [ G_t | s_t = s ]

The action-value function Q^pi(s, a) asks: what is the expected return if I take action a in state s and then follow pi?

Q^pi(s, a) = E_pi [ G_t | s_t = s, a_t = a ]

These two quantities are linked by the Bellman expectation equations, which express the value of a state as a one-step lookahead plus the discounted value of successor states:

V^pi(s) = sum_{a} pi(a|s) * sum_{s'} P(s'|s,a) [ R(s,a,s') + gamma * V^pi(s') ]

The Bellman equations are recursive: the value of a state is defined in terms of the values of its successors. This recursion is the engine that powers dynamic programming methods (policy iteration, value iteration) and, at a distance, temporal-difference learning and Q-learning.

The optimal value function V*(s) satisfies the Bellman optimality equation, which replaces the policy-weighted average with a max:

V*(s) = max_{a} sum_{s'} P(s'|s,a) [ R(s,a,s') + gamma * V*(s') ]

Once V* is known, the optimal policy is greedy with respect to it: pick the action that achieves the max in the equation above.

From theory to practice: a small grid world

Consider a 4x4 grid. States are grid cells; actions are {up, down, left, right}. Hitting a wall keeps you in place. Terminal states (top-left, bottom-right) give reward 0; all other transitions give reward -1. gamma = 1.

A short value-iteration trace:

Iteration 0:  V(s) = 0 for all s
Iteration 1:  V(terminal) = 0, V(adjacent) = -1, rest = 0
Iteration 2:  V(terminal) = 0, V(adjacent) = -1,
              V(two-steps-away) = -2, ...
Converges:    V(corner opposite to terminal) = -6

The converged values directly encode how many steps each cell is from the nearest exit. The optimal policy reads off as "move toward lower V(s')", i.e., gradient descent on the value landscape.

When it falls down

Partial observability. If the agent cannot observe the full state (a robot without a map, a card game where the opponent's hand is hidden) the Markov property breaks. The observation o_t is a noisy function of s_t. The correct model is then a Partially Observable MDP (POMDP), and exact solutions are PSPACE-hard. Practical workarounds involve recurrent networks or belief-state representations, both of which re-introduce history implicitly.

Continuous and high-dimensional state spaces. Grid worlds are tractable because S is small. A robot with 30 joints has a state space of dimension 60+ (positions and velocities). Exact dynamic programming is impossible; function approximation (neural networks) is unavoidable, and convergence guarantees largely vanish. The DQN paper (Mnih et al., 2013) demonstrated that deep Q-networks could handle Atari frames (a 33,600-dimensional pixel state), but training is brittle and sample-inefficient compared to the tabular case.

Reward function design. The MDP framework assumes R is given. In practice, designing a reward function that produces the intended behaviour is hard. Reward hacking (Goodhart's law in RL form) occurs when the agent finds a high-return policy that violates the designer's intent. A cleaning robot rewarded for "not seeing any mess" might learn to cover its camera.

Non-stationarity. MDPs assume P and R are fixed. Multi-agent environments violate this: the effective transition function from one agent's perspective changes as other agents adapt their policies.

Discount factor sensitivity. gamma is not a free lunch. gamma close to 1 makes variance in the return estimate very high (long horizons, high variance trajectories). gamma too low encourages myopic policies that ignore long-term consequences. There is no principled way to set gamma from first principles; it is almost always a hyperparameter tuned by trial.