RT-2 and Web-Scale Robot Learning

A robot has never in its life picked up a dinosaur toy, yet asked to "move the extinct animal to the group of animals" it does exactly that. It has never been trained to identify an improvised hammer, yet handed a scattering of objects it reaches for the rock. RT-2 (Brohan et al., 2023) is the concrete demonstration that these behaviours do not require robot demonstrations of each concept at all. They fall out of one idea: take a model that already learned about extinction and tools and rocks from the internet, and teach it to emit robot actions in the same breath as it emits words.

Actions as just more language

The mechanism is almost aggressively simple, and that is the point. RT-2 starts from a large vision-language model (VLM) already pretrained on internet-scale image-text data: visual question answering, captioning, the usual web corpus of pictures paired with language. Such a model maps an image and a text prompt to a text response. RT-2's move is to make a robot action be a text response.

Actions as just more language

Keep reading with Pro.