Monte Carlo Methods

Blackjack dealers do not explain their deck. You cannot write down the transition probabilities, you cannot derive the expected value analytically, but you can play thousands of hands and track what happens from each state. That is, essentially, what Monte Carlo reinforcement learning does: replace the model with experience.

The Bellman equations tell us that value functions can be decomposed recursively. Temporal-difference (TD) methods exploit that structure to update after every step. Monte Carlo ignores the recursion entirely and waits until an episode ends, then backs up the actual return. That distinction carries real consequences for bias, variance, and where each method breaks down.

The Core Idea: Averaging Actual Returns

In a Markov Decision Process, the state-value function is defined as:

V(s) = E[G_t | S_t = s]

where G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ... is the discounted sum of future rewards. The expectation is over trajectories generated by the agent's policy.

Monte Carlo estimation replaces that expectation with a sample mean. Run the policy from state s to the end of the episode, compute the actual discounted return G_t, and update an average:

V(s) ← V(s) + α [G_t - V(s)]

This is the incremental mean update. With α = 1/N(s) where N(s) is the visit count, it converges to the exact mean. In practice a constant small α is often used to track non-stationary policies.

First-visit vs. every-visit. If a state appears multiple times in one episode, first-visit MC uses only the return from the first occurrence; every-visit MC averages all of them. Both converge to V_π(s) as episodes accumulate, but first-visit has cleaner theoretical properties (unbiased, independent samples per episode).

Why Monte Carlo is Unbiased but High-Variance

The fundamental trade-off between Monte Carlo and TD methods comes down to bias versus variance.

TD(0) updates using a bootstrapped estimate: R_{t+1} + γV(S_{t+1}). That bootstrap introduces bias because V(S_{t+1}) is itself an approximation, not the true value. Monte Carlo uses the actual return G_t, which is an unbiased estimate of V_π(s) under the current policy. No approximation enters.

The cost is variance. The return G_t is the product of many stochastic decisions and transitions. A single bad roll of dice late in a 200-step episode contaminates the entire update. TD(0), by contrast, uses only one transition, so the variance of each update is much lower.

A useful framing:

Property	Monte Carlo	TD(0)
Bias	None (unbiased)	Yes (bootstrap bias)
Variance	High	Low
Requires episode end	Yes	No
Works in continuing tasks	No	Yes
Sensitive to Markov property	No	Yes

The sensitivity row matters. TD methods implicitly assume the Markov property holds: V(S_{t+1}) being a valid target requires that the future depends on the current state, not on history. Monte Carlo makes no such assumption. If your environment violates the Markov property (e.g., partial observability, memory-dependent dynamics), MC estimates remain consistent while TD targets can be systematically wrong.

Monte Carlo Control: Learning the Optimal Policy

Value estimation is only half the problem. To improve a policy, you need action values Q(s, a), not just state values, because the agent must compare what happens when it takes each action.

Monte Carlo control alternates between:

Policy evaluation: run many episodes under the current policy π, average returns to estimate Q_π(s, a).
Policy improvement: act greedily with respect to the updated Q.

A key problem surfaces immediately. If the policy is deterministic, many state-action pairs are never visited. The estimates for unvisited pairs remain uninitialised, and greedy improvement will never explore them. Two standard fixes:

Exploring starts: require every episode to begin with a randomly chosen state-action pair. Impractical in most real systems.
On-policy with epsilon-greedy: use a soft policy that selects the greedy action with probability 1-ε and a random action with probability ε. This guarantees all pairs are visited infinitely often as episodes accumulate.

Off-policy MC adds a third option: collect data under a behaviour policy b (e.g., fully random), then estimate Q_π for a different target policy π using importance sampling. The importance sampling ratio corrects for the distribution mismatch:

G_t^{π/b} = (π(A_t|S_t)/b(A_t|S_t)) * ... * (π(A_T|S_T)/b(A_T|S_T)) * G_t

Ordinary importance sampling is unbiased; weighted importance sampling has lower variance and is almost always preferred in practice. Off-policy MC is conceptually elegant but the variance of the importance ratio grows exponentially with episode length, making it unreliable for long horizons.

The Lambda Bridge: Connecting MC to TD

The spectrum between full Monte Carlo and one-step TD is captured by the eligibility trace parameter λ ∈ [0, 1]. The TD(λ) target is a geometric mixture of n-step returns:

G_t^λ = (1 - λ) Σ_{n=1}^{∞} λ^{n-1} G_t^(n)

where G_t^(n) is the n-step return. Setting λ = 0 recovers TD(0); setting λ = 1 recovers Monte Carlo (given episodic tasks). This is not just a theoretical curiosity. REINFORCE, the original policy gradient algorithm, is a Monte Carlo method at heart: it samples complete trajectories and uses the full return G_t to weight the log-probability gradient. The GAE (Generalised Advantage Estimation) paper (Schulman et al., 2016, arXiv:1506.02438) shows that trading off λ between 0 and 1 controls the bias-variance frontier directly for policy gradient variance reduction.

When It Falls Down

Episodic requirement. Monte Carlo needs an episode to end before any update occurs. Robotics tasks with 10-minute episodes, or continuing tasks with no natural termination, make pure MC impractical or impossible. TD methods, which update on every step, are the only option in those settings.

High variance in long episodes. A 500-step episode accumulates variance across every stochastic choice. The estimate for states visited early in the episode can be extremely noisy, requiring orders of magnitude more samples than TD to achieve the same mean-squared error on value estimates.

Off-policy importance sampling blowup. When the behaviour and target policies differ substantially, the importance ratio for a 100-step trajectory can be astronomically large or near zero. Effective sample size collapses. In practice, off-policy MC works well only for short episodes or when the two policies are close.

No intermediate credit. Because updates wait for the episode end, MC assigns no credit to intermediate states before the final reward arrives. In sparse-reward environments with long delays between actions and consequences, this makes learning extremely slow. TD's step-by-step bootstrapping propagates information faster.

Non-stationary and function approximation instability. When Q(s, a) is approximated by a neural network and off-policy data is replayed, MC targets are less stable than TD targets because they include the compounded randomness of every future step. This is one reason that DQN (Mnih et al., 2013, arXiv:1312.5602) uses one-step TD targets rather than full-return MC targets even though MC would be unbiased.

The Core Idea: Averaging Actual Returns

Why Monte Carlo is Unbiased but High-Variance

Monte Carlo Control: Learning the Optimal Policy

The Lambda Bridge: Connecting MC to TD

When It Falls Down

Further Reading