Vision-Language-Action Models

Ask a robot to "pick up the object that could work as a hammer" and, historically, it had no chance: nothing in its few thousand teleoperated demonstrations ever mentioned hammers, improvised tools, or the physics of striking. A vision-language-action (VLA) model can do it, not because it was trained on hammers, but because the vision-language backbone underneath it already read the internet. VLAs are the LLM playbook applied to control: take a model that learned the meaning of the visual world from web-scale image-text data, then teach it to emit actions instead of captions. The bet is that semantic generalisation transfers from language and vision into the body.

The recipe

A VLA is a vision-language model (VLM) with an action output bolted on and fine-tuned on robot demonstrations. The three moving parts:

The recipe

Keep reading with Pro.