← Concept library

Foundations

Action Tokenisation and Representation

How continuous robot actions become discrete tokens an autoregressive transformer can emit, from per-dimension binning to frequency-space compression.

intermediate · 7 min read

A transformer emits tokens from a fixed vocabulary. A robot arm wants a stream of real numbers: seven joint angles, a three-axis end-effector delta, a continuous gripper width, updated tens or hundreds of times a second. Bridging that gap is action tokenisation, and the choice you make there quietly fixes almost everything downstream: how fast you can control the robot, how many tokens each decision costs, how smooth the motion is, and whether an autoregressive policy can keep up with real time at all. Get the representation wrong and a model that reasons beautifully about the world produces jerky, imprecise, or hopelessly slow control.

The naive scheme: per-dimension binning

The first workable answer is the obvious one. Take each action dimension, clip it to a known range, and chop that range into a fixed number of uniform bins. Each real value becomes an integer bin index, and each index is a token the model predicts like any other.

RT-1 (Google, 2022) is the canonical example. It discretises each of eleven action dimensions (seven for the arm: x, y, z, roll, pitch, yaw, gripper; three for the mobile base; one mode switch) into 256 uniform bins, and the transformer autoregressively predicts one 256-way categorical per dimension per timestep. RT-2 (2023) pushed the same idea into a vision-language model: it expresses the discretised actions as text tokens and folds them into the training set exactly like natural-language tokens, so a VLM pretrained on the web can emit robot actions in the same autoregressive breath as words.

The appeal is that it slots straight into existing transformer machinery. No new loss, no new head, just a bigger vocabulary or a reused slice of the text vocabulary. For coarse, low-frequency behaviours (pick this up, move there) it works.

Why per-timestep binning breaks at high frequency

The trouble is arithmetic. A per-dimension, per-timestep scheme spends (action_dims x timesteps) tokens per decision. At 3 Hz control with 7 dimensions that is tolerable. Take the same robot to 50 Hz for a dexterous, contact-rich task and the token budget explodes: predicting a one-second horizon now means hundreds of tokens, all decoded one at a time.

Two things go wrong at once:

  • Token-count blow-up. Autoregressive decoding is sequential, so more tokens per action means proportionally more forward passes and more latency, precisely when you have less time between control cycles.
  • Lost correlations. Adjacent timesteps in a smooth trajectory are highly redundant. Encoding each one independently ignores that structure, so the model burns capacity re-predicting near-identical values and the representation carries far less information per token than it could. Physical Intelligence's FAST paper shows this concretely: simple per-dimension binning "fails completely" on highly dexterous, high-frequency data.

Chunking: predict a horizon, not a step

The first structural fix is to stop deciding one timestep at a time. Action chunking, introduced by ACT (Zhao et al., 2023, "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"), has the policy predict a short block of future actions in one shot, then execute that block before re-planning. ACT learns a generative model over action sequences rather than single frames.

The immediate win is against compounding error, the imitation-learning failure where a small mistake nudges the robot into states the demonstrations never covered, and the errors snowball. Committing to a chunk means fewer independent decisions per trajectory and fewer seams at which error can accumulate. It also amortises latency: one inference produces many control steps. The tension is closed-loop reactivity, since a robot mid-chunk is briefly running open-loop, which is why chunk sizes are kept short and often blended across overlapping predictions.

Compression tokenisers: FAST and the frequency domain

Chunking reduces how often you decide; it does not by itself reduce the tokens per chunk. That is what compression tokenisers attack. FAST (Frequency-space Action Sequence Tokenization, Pertsch et al., Physical Intelligence, 2025) treats a chunk of actions as a signal and compresses it before tokenising.

The pipeline borrows from lossy media codecs. Apply a discrete cosine transform to the action-over-time signal, which concentrates a smooth trajectory's energy into a few low-frequency coefficients; quantise those coefficients; then entropy-code the result (byte-pair encoding over the quantised stream) into a compact token sequence. A smooth motion that would have cost hundreds of naive tokens collapses to a handful, because the DCT captures the shape of the trajectory instead of sampling it timestep by timestep.

The payoff is that autoregressive VLAs become trainable on the dexterous, high-frequency tasks where binning collapses. Paired with the pi0 architecture, FAST reached results competitive with diffusion-based policies while cutting training time substantially, and the released FAST+ tokeniser was fitted on roughly one million real robot trajectories as a general-purpose front end.

This is the through-line worth holding onto: the representation is not a formatting detail. Tokens-per-action multiplied by decode-time-per-token sets your control frequency, so tokenisation is where you spend or save your real-time budget.

When it falls down

  • Quantisation error. Binning throws away everything finer than one bin. 256 bins across a joint's full range can be too coarse for sub-millimetre insertion; the policy simply cannot express a target between two bin centres. FAST's quantisation of DCT coefficients is lossy for the same reason, just spent more wisely.
  • Token-count blow-up at high frequency. Naive per-timestep tokens scale with control rate. Above a few tens of Hz the decode cost alone can exceed the control period, so the policy cannot physically keep pace with the robot.
  • Mode averaging kills smoothness and multimodality. When several valid actions exist (go left or right around an obstacle), a discretised policy trained to match demonstrations can hedge across bins and emit the average, a value that is often physically wrong and produces jerky, indecisive motion. This is the classic argument for diffusion or flow-matching action heads, which represent the full multimodal distribution instead of a per-dimension categorical.
  • Autoregression versus real time. Sequential token emission is fundamentally at odds with a hard control deadline. Chunking and compression buy headroom, but the tension never disappears; it is why some systems abandon token-by-token decoding for parallel continuous heads once frequency demands get severe.

Further reading

Sign in to save and react.
Share Copied