Model-Based Reinforcement Learning

AlphaGo needed around 5 million self-play games to surpass human Go players. A well-tuned model-based agent tackling the same problem can learn a comparable policy with orders of magnitude fewer environment interactions, because it spends most of its compute on a learned simulator rather than the real world. That asymmetry is the central argument for model-based reinforcement learning (MBRL).

The fundamental split: model-free vs. model-based

In model-free RL, the agent learns a value function or policy entirely from samples of real experience. Each gradient step consumes data collected by actually executing actions. The environment is a black box: you query it, observe (s, a, r, s'), and move on. No structure is assumed.

Model-based RL adds one object: a dynamics model M that approximates the environment's transition and reward functions:

M : (s_t, a_t) --> (s_{t+1}, r_t)

Once trained, M can be queried millions of times per second without touching the real environment. This changes the data budget problem entirely. Real interactions are expensive (robot wear, simulation wall-time, API costs); synthetic rollouts through M are cheap.

The trade-off is model bias: errors in M compound over long rollouts, turning a small prediction mistake into a completely fictitious trajectory. Every design decision in MBRL is, in some form, an attempt to exploit the sample efficiency of the model while limiting the damage its errors cause.

How agents use a learned model

There are three broad strategies, and production systems often combine them.

1. Dyna-style background planning

Sutton's Dyna architecture (1991) remains the cleanest illustration. The agent collects real experience, updates M, then generates k synthetic transitions from M and uses those to update the policy or value function alongside real data. The ratio of synthetic to real updates is a hyper-parameter that controls the sample-efficiency / bias trade-off.

for each real step:
    observe (s, a, r, s')
    update M with (s, a, r, s')
    for k in range(K):
        s_sim  = sample from replay buffer
        a_sim  = policy(s_sim)
        r_sim, s'_sim = M(s_sim, a_sim)
        update policy/value with (s_sim, a_sim, r_sim, s'_sim)

MBPO (Janner et al., NeurIPS 2019) formalised this with a theoretical bound: branching short synthetic rollouts (length 1-5) from real data limits the compounding of model error while providing enough imagined experience to accelerate learning. On MuJoCo locomotion tasks, MBPO reaches model-free asymptotic performance using roughly 20-40x fewer environment samples.

2. Latent-space imagination

World Models (Ha and Schmidhuber, 2018) and the Dreamer family (Hafner et al., 2019) learn a compact latent representation of the state and train the dynamics model entirely in that compressed space. Instead of predicting pixel-level observations, the model predicts the next latent vector. Policy gradients can then be backpropagated through the differentiable imagined rollout:

The fundamental split: model-free vs. model-based

How agents use a learned model

1. Dyna-style background planning

2. Latent-space imagination

Keep reading with Pro.