Foundations
Monte Carlo Methods
Monte Carlo methods estimate value functions by averaging complete episode returns, making them the simplest model-free approach but one that requires episodic tasks and carries high variance.
intermediate · 7 min read
Blackjack dealers do not explain their deck. You cannot write down the transition probabilities, you cannot derive the expected value analytically, but you can play thousands of hands and track what happens from each state. That is, essentially, what Monte Carlo reinforcement learning does: replace the model with experience.
The Bellman equations tell us that value functions can be decomposed recursively. Temporal-difference (TD) methods exploit that structure to update after every step. Monte Carlo ignores the recursion entirely and waits until an episode ends, then backs up the actual return. That distinction carries real consequences for bias, variance, and where each method breaks down.
The Core Idea: Averaging Actual Returns
In a Markov Decision Process, the state-value function is defined as:
V(s) = E[G_t | S_t = s]
where G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ... is the discounted sum of future rewards. The expectation is over trajectories generated by the agent's policy.
Monte Carlo estimation replaces that expectation with a sample mean. Run the policy from state s to the end of the episode, compute the actual discounted return G_t, and update an average:
V(s) ← V(s) + α [G_t - V(s)]
This is the incremental mean update. With α = 1/N(s) where N(s) is the visit count, it converges to the exact mean. In practice a constant small α is often used to track non-stationary policies.
First-visit vs. every-visit. If a state appears multiple times in one episode, first-visit MC uses only the return from the first occurrence; every-visit MC averages all of them. Both converge to V_π(s) as episodes accumulate, but first-visit has cleaner theoretical properties (unbiased, independent samples per episode).
Why Monte Carlo is Unbiased but High-Variance
The fundamental trade-off between Monte Carlo and TD methods comes down to bias versus variance.
TD(0) updates using a bootstrapped estimate: R_{t+1} + γV(S_{t+1}). That bootstrap introduces bias because V(S_{t+1}) is itself an approximation, not the true value. Monte Carlo uses the actual return G_t, which is an unbiased estimate of V_π(s) under the current policy. No approximation enters.
The cost is variance. The return G_t is the product of many stochastic decisions and transitions. A single bad roll of dice late in a 200-step episode contaminates the entire update. TD(0), by contrast, uses only one transition, so the variance of each update is much lower.
A useful framing:
| Property | Monte Carlo | TD(0) |
|---|---|---|
| Bias | None (unbiased) | Yes (bootstrap bias) |
| Variance | High | Low |
| Requires episode end | Yes | No |
| Works in continuing tasks | No | Yes |
| Sensitive to Markov property | No | Yes |
The sensitivity row matters. TD methods implicitly assume the Markov property holds: V(S_{t+1}) being a valid target requires that the future depends on the current state, not on history. Monte Carlo makes no such assumption. If your environment violates the Markov property (e.g., partial observability, memory-dependent dynamics), MC estimates remain consistent while TD targets can be systematically wrong.
Monte Carlo Control: Learning the Optimal Policy
Value estimation is only half the problem. To improve a policy, you need action values Q(s, a), not just state values, because the agent must compare what happens when it takes each action.
Monte Carlo control alternates between:
- Policy evaluation: run many episodes under the current policy
π, average returns to estimateQ_π(s, a). - Policy improvement: act greedily with respect to the updated
Q.
A key problem surfaces immediately. If the policy is deterministic, many state-action pairs are never visited. The estimates for unvisited pairs remain uninitialised, and greedy improvement will never explore them. Two standard fixes:
- Exploring starts: require every episode to begin with a randomly chosen state-action pair. Impractical in most real systems.
- On-policy with epsilon-greedy: use a soft policy that selects the greedy action with probability
1-εand a random action with probabilityε. This guarantees all pairs are visited infinitely often as episodes accumulate.
Off-policy MC adds a third option: collect data under a behaviour policy b (e.g., fully random), then estimate Q_π for a different target policy π using importance sampling. The importance sampling ratio corrects for the distribution mismatch:
G_t^{π/b} = (π(A_t|S_t)/b(A_t|S_t)) * ... * (π(A_T|S_T)/b(A_T|S_T)) * G_t
Ordinary importance sampling is unbiased; weighted importance sampling has lower variance and is almost always preferred in practice. Off-policy MC is conceptually elegant but the variance of the importance ratio grows exponentially with episode length, making it unreliable for long horizons.
The Lambda Bridge: Connecting MC to TD
The spectrum between full Monte Carlo and one-step TD is captured by the eligibility trace parameter λ ∈ [0, 1]. The TD(λ) target is a geometric mixture of n-step returns:
G_t^λ = (1 - λ) Σ_{n=1}^{∞} λ^{n-1} G_t^(n)
where G_t^(n) is the n-step return. Setting λ = 0 recovers TD(0); setting λ = 1 recovers Monte Carlo (given episodic tasks). This is not just a theoretical curiosity. REINFORCE, the original policy gradient algorithm, is a Monte Carlo method at heart: it samples complete trajectories and uses the full return G_t to weight the log-probability gradient. The GAE (Generalised Advantage Estimation) paper (Schulman et al., 2016, arXiv:1506.02438) shows that trading off λ between 0 and 1 controls the bias-variance frontier directly for policy gradient variance reduction.
When It Falls Down
Episodic requirement. Monte Carlo needs an episode to end before any update occurs. Robotics tasks with 10-minute episodes, or continuing tasks with no natural termination, make pure MC impractical or impossible. TD methods, which update on every step, are the only option in those settings.
High variance in long episodes. A 500-step episode accumulates variance across every stochastic choice. The estimate for states visited early in the episode can be extremely noisy, requiring orders of magnitude more samples than TD to achieve the same mean-squared error on value estimates.
Off-policy importance sampling blowup. When the behaviour and target policies differ substantially, the importance ratio for a 100-step trajectory can be astronomically large or near zero. Effective sample size collapses. In practice, off-policy MC works well only for short episodes or when the two policies are close.
No intermediate credit. Because updates wait for the episode end, MC assigns no credit to intermediate states before the final reward arrives. In sparse-reward environments with long delays between actions and consequences, this makes learning extremely slow. TD's step-by-step bootstrapping propagates information faster.
Non-stationary and function approximation instability. When Q(s, a) is approximated by a neural network and off-policy data is replayed, MC targets are less stable than TD targets because they include the compounded randomness of every future step. This is one reason that DQN (Mnih et al., 2013, arXiv:1312.5602) uses one-step TD targets rather than full-return MC targets even though MC would be unbiased.
Further Reading
- Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapters 5 and 7 - the authoritative treatment of MC prediction, MC control, and the TD(λ) unification. The full text is freely available at incompleteideas.net/book/the-book-2nd.html.
- Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalised Advantage Estimation" - shows how the MC/TD trade-off maps directly onto policy gradient variance: https://arxiv.org/abs/1506.02438.
- Mnih et al. (2013), "Playing Atari with Deep Reinforcement Learning" - a concrete example of why practitioners reach for TD over MC in deep RL: https://arxiv.org/abs/1312.5602.
- OpenAI Spinning Up, "Vanilla Policy Gradient" - shows Monte Carlo returns in use inside a working policy gradient implementation: https://spinningup.openai.com/en/latest/algorithms/vpg.html.