Action Chunking Transformer (ACT) is one of the most effective approaches to robot imitation learning. Given a few human demonstrations, it learns to replicate complex manipulation tasks — picking, placing, inserting — with remarkable precision.
The core idea: instead of predicting one robot action at a time (like a language model predicts one token), ACT predicts an entire chunk of future actions simultaneously. This turns out to be dramatically more effective.
Before we dive into ACT itself, we need to build up the foundations. We'll start with the transformer's core mechanism — attention — and work our way up to the full architecture.
This walkthrough follows Willem's study notes, enriched with interactive diagrams and self-test questions. By the end, you'll understand every component of ACT and why each design choice was made.
At the heart of every transformer is self-attention. It's the mechanism that lets each element in a sequence look at every other element and decide what's relevant.
Given an input matrix X (where each row is a token/position embedding), we compute three matrices:
These are just linear projections — multiply X by different learned weight matrices to get different "views" of the data:
Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I carry?"
The attention output is then:
The dot product QK^T measures how much each query matches each key. We scale by sqrt(d) to prevent the softmax from becoming too peaked when dimensions are large. The softmax turns these scores into probabilities, and we use them to take a weighted sum of the values.
QK^T grow in magnitude, pushing the softmax into regions where it has extremely small gradients (near 0 or 1). This makes training unstable. Dividing by sqrt(d) keeps the variance of the dot products at approximately 1, regardless of the dimensionality.Not all attention is created equal. The two main variants serve fundamentally different purposes:
Causal (masked) attention is used in decoder-only models like GPT. Position t can only attend to positions 0 through t. This preserves temporal ordering — the model can't "peek" at future tokens. It's enforced by masking future positions to negative infinity before the softmax.
Bidirectional attention is used in encoders (like BERT) and in encoder-decoder cross-attention. Every position can attend to every other position. This gives the model full context but means you can't do autoregressive generation.
Why does this matter for ACT? This is the critical design decision. If ACT used causal attention in its decoder (like GPT generating one token at a time), action t could only depend on actions 0 through t-1. The authors wanted the decoder to reason about the entire output sequence at once — and they found a clever way to do it using DETR-style queries (we'll get there in step 5).
The ACT encoder is the right-hand side of the architecture (the blue transformer). It takes in the robot's current observation and produces a set of rich contextual embeddings for the decoder to attend to.
The inputs to the encoder are:
1. Camera images — One or more RGB images are passed through a ResNet-18 backbone (with the classification head removed). For each image, this produces 512 feature maps of size 15×20. These are reshaped into 300 vectors of dimension 512 (the hidden dimension). Positional embeddings are added so the model knows which spatial region each vector came from.
2. Robot proprioception — The current state of the robot (14 floats: 7 joint positions + 7 joint velocities, or similar) is projected to dimension 512 using a single linear layer.
3. Style variable z — A latent variable from the CVAE (more on this in step 7). During training, this carries information about the specific trajectory. During inference, it's just zeros.
The reshaping operation is worth pausing on: each of the 300 vectors concatenates all 512 feature activations at a particular spatial location. So vector i represents "everything the CNN sees" at spatial position i of the image.
All these inputs get concatenated into a single sequence and passed through the transformer encoder's self-attention layers. The output: a set of embeddings that capture the full context of the current observation.
This is the most ingenious part of ACT. In a standard transformer decoder (like GPT), you feed in the output tokens you've generated so far and predict the next one. This requires causal masking and sequential generation.
ACT borrows from DETR (DEtection TRansformer), an object detection model. DETR's key innovation: instead of feeding actual data into the decoder, you feed in learned query embeddings — fixed vectors that the model learns during training.
In DETR, each query learns to "look for" a particular object in the image. In ACT, each query represents a time step in the action sequence. If the chunk size is 100, there are 100 learned queries, one per predicted action.
Critical implementation detail: The actual input to the decoder is a vector of all zeros. The learned query embeddings are added as positional encodings at every decoder layer:
Because we're not passing in any ground-truth actions, there's no need for causal masking. Each query can attend to all other queries and to the full encoder output. This means when predicting action at time step t, the model can consider what it's predicting at time steps t-1 and t+1.
Let's put the full picture together. The ACT architecture has two main flows during training:
Right side (blue): The main encoder-decoder transformer.
Left side (gray): The CVAE encoder (a separate, smaller transformer).
The loss function is straightforward: L1 distance between predicted and actual joint positions/actions at each time step. No fancy reward shaping, no reinforcement learning — just supervised regression on demonstration data.
This simplicity is part of ACT's appeal. The architectural innovations (DETR queries + CVAE) do the heavy lifting, allowing a simple loss to be highly effective.
Here's a fundamental problem with imitation learning: there are infinitely many valid trajectories from point A to point B. Your demonstrations might reach for an object from the left, from the right, from above — all valid, all different.
If you simply train a model to minimize L1 loss against demonstrations, it might learn the average of all trajectories — which could be a trajectory that goes nowhere useful. This is the "multimodality problem."
ACT's solution: the Conditional Variational Autoencoder (CVAE) style variable.
During training: A separate encoder (the left-side transformer) looks at the ground-truth action sequence and compresses its "style" — the unique characteristics of this particular trajectory — into a low-dimensional variable z. This z is then passed to the main encoder as additional context.
The main encoder-decoder learns: "Given these images, this robot state, and this style hint, predict these specific actions." The style variable resolves the ambiguity.
During inference: We discard the CVAE encoder entirely. We set z = 0 (zeros). This says to the model: "I don't care about any particular style — just give me a reasonable trajectory."
Chunking is the idea that gives ACT its name. Instead of predicting the next single action, the model predicts the next n actions (a "chunk") all at once.
The chunk size is a hyperparameter — typically 50 to 100 time steps. At each observation, the model outputs a full trajectory chunk, but in practice you execute some fraction of it before querying the model again.
The surprising finding: chunking dramatically improves performance across all methods, not just ACT:
Why does chunking work? The explanation draws from cognitive science:
Humans don't plan movement at the level of individual muscle activations. We think in terms of cohesive actions: "reach for the cup," "lift it up," "bring it to my mouth." We chunk movements into meaningful units.
The same principle applies to neural networks. It's "easier" to learn and reproduce sequences of coordinated movements than individual atomic actions. Predicting a chunk forces the model to reason about the intent behind a movement, not just the next incremental step.
In practice, temporal ensembling smooths out the transitions between chunks. When you query the model at time t and again at t+k, the overlapping predictions are averaged with exponential weighting, giving priority to more recent predictions.
Training pipeline:
Inference pipeline (simplified):
The results speak for themselves. ACT dramatically outperforms previous behavior cloning methods across a range of manipulation tasks. The combination of DETR-style parallel decoding, CVAE for multimodality, and action chunking creates an approach that's both elegant and practical.
The key takeaways:
Each piece is important, but it's their combination that makes ACT so potent. The transformer's ability to pick up on subtle patterns is unparalleled, and these architectural choices give it the right inductive biases for robot control.
You've now walked through every component of ACT. Let's make sure it all clicks with some harder questions:
[512, 15, 20] — 512 channels, each a 15x20 spatial grid. The reshape transposes this to [15*20, 512] = [300, 512]. Each of the 300 vectors corresponds to one spatial position in the feature grid and contains all 512 channel values at that position. This converts a spatial feature map into a sequence suitable for the transformer.Papers to read next: