Foundations
RT-2 and Web-Scale Robot Learning
RT-2 co-trains one transformer on internet vision-language data and robot trajectories by encoding actions as text tokens, transferring semantic web knowledge into robotic control.
advanced · 7 min read · Premium
A robot has never in its life picked up a dinosaur toy, yet asked to "move the extinct animal to the group of animals" it does exactly that. It has never been trained to identify an improvised hammer, yet handed a scattering of objects it reaches for the rock. RT-2 (Brohan et al., 2023) is the concrete demonstration that these behaviours do not require robot demonstrations of each concept at all. They fall out of one idea: take a model that already learned about extinction and tools and rocks from the internet, and teach it to emit robot actions in the same breath as it emits words.
Actions as just more language
The mechanism is almost aggressively simple, and that is the point. RT-2 starts from a large vision-language model (VLM) already pretrained on internet-scale image-text data: visual question answering, captioning, the usual web corpus of pictures paired with language. Such a model maps an image and a text prompt to a text response. RT-2's move is to make a robot action be a text response.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.