← Concept library

Foundations

Vision-Language-Action Models

Turning a pretrained vision-language model into a robot policy that maps camera images plus a language instruction to motor actions, so the robot inherits web-scale semantic knowledge it could never learn from robot data alone.

advanced · 8 min read · Premium

Ask a robot to "pick up the object that could work as a hammer" and, historically, it had no chance: nothing in its few thousand teleoperated demonstrations ever mentioned hammers, improvised tools, or the physics of striking. A vision-language-action (VLA) model can do it, not because it was trained on hammers, but because the vision-language backbone underneath it already read the internet. VLAs are the LLM playbook applied to control: take a model that learned the meaning of the visual world from web-scale image-text data, then teach it to emit actions instead of captions. The bet is that semantic generalisation transfers from language and vision into the body.

The recipe

A VLA is a vision-language model (VLM) with an action output bolted on and fine-tuned on robot demonstrations. The three moving parts:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied