Imitation Learning and Diffusion Policies

Teach a robot by showing it. Collect a few hundred demonstrations of a human driving the arm through a task, train a network to map each observed state to the action the human took, and you have a policy. This is behaviour cloning, and it is the most natural idea in robot learning. It is also the idea with the sharpest hidden failure: a policy trained this way does not fail gracefully as it gets slightly worse, it fails catastrophically, because a small error moves the robot into a state the human never visited, and there the policy has no idea what to do. Ross, Gordon and Bagnell made this precise in 2010, and the fix is not more data in the naive sense; it is a different way of modelling what a demonstration even is.

Behaviour cloning and why the errors compound

Behaviour cloning treats control as plain supervised learning. Given demonstration pairs (state, action), fit a function pi(a | s) that reproduces the expert's action. Training is standard: minimise the loss on the demonstration set. The trouble is that supervised learning assumes the test inputs are drawn from the same distribution as the training inputs, and in sequential control that assumption is false the moment the policy starts acting.

Behaviour cloning and why the errors compound

Keep reading with Pro.