Foundations
Vision-Language-Action Models
Turning a pretrained vision-language model into a robot policy that maps camera images plus a language instruction to motor actions, so the robot inherits web-scale semantic knowledge it could never learn from robot data alone.
advanced · 8 min read · Premium
Ask a robot to "pick up the object that could work as a hammer" and, historically, it had no chance: nothing in its few thousand teleoperated demonstrations ever mentioned hammers, improvised tools, or the physics of striking. A vision-language-action (VLA) model can do it, not because it was trained on hammers, but because the vision-language backbone underneath it already read the internet. VLAs are the LLM playbook applied to control: take a model that learned the meaning of the visual world from web-scale image-text data, then teach it to emit actions instead of captions. The bet is that semantic generalisation transfers from language and vision into the body.
The recipe
A VLA is a vision-language model (VLM) with an action output bolted on and fine-tuned on robot demonstrations. The three moving parts:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.