← Concept library

Foundations

Sim-to-Real Transfer

Why robot policies are trained in simulation, why they break on real hardware, and how domain randomisation closes the reality gap by making the real world look like one more random draw.

intermediate · 7 min read

A reinforcement-learning agent learning to manipulate objects needs millions of trials, and each trial in the real world costs a servo's lifespan, a human to reset the scene, and wall-clock time you cannot compress. In simulation the same trials run faster than real time, in parallel across thousands of environments, with no hardware to wear out and no safety envelope to respect. So you train in sim. Then you load the policy onto the robot and it fails, because the simulator was never the real world. That mismatch has a name, and closing it is one of the defining problems of applied robotics.

Why simulate at all

The case for simulation is not subtle, which is why almost every serious robot-learning result leans on it.

  • Cheap and fast. A physics engine runs many times faster than real time and scales horizontally. Data that would take months on hardware arrives in hours.
  • Safe. An RL agent early in training does the equivalent of slamming a joint into a table or flinging an object across the room. In sim that is a reset; on a real arm it is a repair bill.
  • Unlimited, labelled data. The simulator knows the exact pose, mass, and contact state of every object, so you get perfect ground-truth labels for free. No human annotation, no sensor calibration for the training set.
  • Reproducible. You can reset to an identical initial state, which real hardware never quite allows, so debugging and ablations are tractable.

The trap is that a policy optimised against a simulator learns to exploit the simulator, including its inaccuracies. The better it fits sim, the more it can overfit to physics that do not hold outside the engine.

The reality gap

The reality gap is the accumulated mismatch between simulated and real physics and rendering. It shows up on two fronts that behave differently and want different fixes.

  • Perception gap. Rendered images are not camera images. Lighting, textures, shadows, sensor noise, lens distortion, and exposure all differ, so a vision model trained on clean renders sees an out-of-distribution image on the robot.
  • Dynamics gap. Simulated physics is an approximation. Real friction is not a single coefficient, real actuators have latency and backlash, masses and inertias are estimated rather than known, and contact (the moment two rigid bodies touch) is where physics engines are least faithful.

A policy that never saw this variation treats the real reading as noise it was never trained to handle, and the errors compound step by step until the behaviour diverges. Tobin et al. framed the goal cleanly: make the model robust enough that the real world looks like just one more variation it has already seen.

Domain randomisation

The dominant lever is domain randomisation: instead of trying to make one simulator perfectly match reality, randomise the simulator across a wide range of conditions so the real world falls inside that range. If training textures, lighting, and camera positions vary enough, a real photograph is simply another sample from a distribution the policy already handles.

Tobin et al. (2017) demonstrated this for perception. Training an object detector purely on simulated RGB images with randomised, deliberately non-realistic textures, lighting, and camera pose, they transferred it to a real robot and localised objects to within about 1.5 cm using no real training images at all. The renders did not look real; they looked varied, and variety was what transferred.

Peng et al. (2017) applied the same idea to dynamics rather than pixels. By randomising physical parameters, masses, friction, damping, actuator behaviour, during training, a policy for a pushing task on a robot arm transferred from sim to hardware without any physical training. Randomising the physics forces the policy to be robust to the fact that the true parameters are unknown.

The knobs you randomise, split by which gap they attack:

Gap Randomise
Perception textures, colours, lighting direction and intensity, camera pose and field of view, sensor noise, distractor objects
Dynamics link masses and inertias, friction coefficients, joint damping, actuator gains, control and observation latency, external forces

Automatic domain randomisation

Fixed randomisation ranges force an awkward choice: too narrow and reality escapes the distribution; too wide and the task becomes so hard the policy learns nothing useful. OpenAI's Rubik's-cube hand (2019) addressed this with Automatic Domain Randomisation (ADR), which turns the range into a curriculum. ADR starts with a narrow distribution and widens each parameter's range automatically as the policy's performance on that setting crosses a threshold. The environment grows harder exactly as fast as the policy can absorb, so you never hand-tune the ranges and never stall on a distribution the policy cannot yet handle. The result was a five-fingered hand solving a Rubik's cube, a contact-rich, high-dimensional manipulation task, trained entirely in simulation.

Beyond randomisation

Domain randomisation is the workhorse, not the whole toolkit, and the other approaches are usually combined with it rather than chosen instead.

  • System identification. Measure the real robot and fit the simulator's parameters to it, so the centre of your distribution sits near reality. Better calibration means you need less randomisation to cover the gap.
  • Better simulators. More faithful contact models, differentiable physics, and rendering closer to real camera output shrink the gap at the source. Fidelity has a cost in compute and in engineering time, so it trades against the brute-force breadth of randomisation.
  • Real-world fine-tuning. Pretrain in sim, then adapt with a small amount of real data. Sim does the heavy lifting on sample count; a short real-world phase corrects the residual gap that randomisation left. This hybrid is often the most practical recipe.

When it falls down

  • Over-randomisation buys robustness with mediocrity. Widen the ranges too far and the policy hedges against conditions that never occur, converging to a conservative, average behaviour that is robust everywhere and excellent nowhere. There is a real tension between breadth and peak performance; ADR-style curricula exist precisely because the static ranges are hard to set right.
  • Some effects resist randomisation. Contact dynamics, deformable objects, cloth, granular media, and fluids are exactly where simulators are least faithful, and randomising a bad model does not manufacture the physics it is missing. You cannot randomise your way out of a phenomenon the engine cannot represent.
  • Perception and dynamics need different treatment. Texture and lighting randomisation does nothing for actuator latency, and mass randomisation does nothing for a camera that sees unfamiliar glare. Diagnose which gap is failing before turning knobs; the two are independent budgets.
  • Sim metrics mislead. High reward in simulation is not reliability on hardware. A policy can look finished in sim and be brittle in the real world, so the honest measure is real-world success across many trials and conditions, not the training curve. Treat the simulator's verdict as a hypothesis, not a result.

Further reading

Sign in to save and react.
Share Copied