🤖

The Big Picture

What ACT is and why it matters
act robotics architecture

Action Chunking Transformer (ACT) is one of the most effective approaches to robot imitation learning. Given a few human demonstrations, it learns to replicate complex manipulation tasks — picking, placing, inserting — with remarkable precision.

The core idea: instead of predicting one robot action at a time (like a language model predicts one token), ACT predicts an entire chunk of future actions simultaneously. This turns out to be dramatically more effective.

ACT combines three powerful ideas: (1) an encoder-decoder transformer for action generation, (2) the DETR approach from object detection for parallel decoding, and (3) a CVAE-style variable to handle the multimodality of demonstrations.
Full ACT architecture (Zhao et al., 2023) — CVAE encoder (left, training only) and main encoder-decoder (right)

Before we dive into ACT itself, we need to build up the foundations. We'll start with the transformer's core mechanism — attention — and work our way up to the full architecture.

This walkthrough follows Willem's study notes, enriched with interactive diagrams and self-test questions. By the end, you'll understand every component of ACT and why each design choice was made.

🔑

Self-Attention

Q, K, V — the building blocks of transformers
transformer attention

At the heart of every transformer is self-attention. It's the mechanism that lets each element in a sequence look at every other element and decide what's relevant.

Given an input matrix X (where each row is a token/position embedding), we compute three matrices:

These are just linear projections — multiply X by different learned weight matrices to get different "views" of the data:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I carry?"

The attention output is then:

The dot product QK^T measures how much each query matches each key. We scale by sqrt(d) to prevent the softmax from becoming too peaked when dimensions are large. The softmax turns these scores into probabilities, and we use them to take a weighted sum of the values.

Self-attention: each position attends to all other positions via Q, K, V projections
Why do we divide by sqrt(d) in the attention formula? What would happen without it?
Without the scaling, when d is large the dot products QK^T grow in magnitude, pushing the softmax into regions where it has extremely small gradients (near 0 or 1). This makes training unstable. Dividing by sqrt(d) keeps the variance of the dot products at approximately 1, regardless of the dimensionality.
🔀

Causal vs. Bidirectional

Two flavors of attention for different jobs
transformer attention

Not all attention is created equal. The two main variants serve fundamentally different purposes:

Causal (masked) attention is used in decoder-only models like GPT. Position t can only attend to positions 0 through t. This preserves temporal ordering — the model can't "peek" at future tokens. It's enforced by masking future positions to negative infinity before the softmax.

Bidirectional attention is used in encoders (like BERT) and in encoder-decoder cross-attention. Every position can attend to every other position. This gives the model full context but means you can't do autoregressive generation.

Left: causal mask (GPT-style) — Right: bidirectional (BERT-style / encoder)

Why does this matter for ACT? This is the critical design decision. If ACT used causal attention in its decoder (like GPT generating one token at a time), action t could only depend on actions 0 through t-1. The authors wanted the decoder to reason about the entire output sequence at once — and they found a clever way to do it using DETR-style queries (we'll get there in step 5).

The encoder in ACT uses bidirectional attention over the input (images + robot state). The decoder also gets to see all positions — no causal mask needed — thanks to the DETR trick.
In a standard autoregressive decoder, why is causal masking necessary? And what limitation does it impose on the generated sequence?
Causal masking prevents information leakage during training — without it, the model could trivially "copy" the answer from future positions. The limitation: each action is generated conditioned only on past actions, so the model can't consider future context when making a decision. ACT sidesteps this entirely by generating all actions in parallel via learned queries.
📸

The ACT Encoder

Images + robot state → contextual embeddings
act architecture cnn

The ACT encoder is the right-hand side of the architecture (the blue transformer). It takes in the robot's current observation and produces a set of rich contextual embeddings for the decoder to attend to.

The inputs to the encoder are:

1. Camera images — One or more RGB images are passed through a ResNet-18 backbone (with the classification head removed). For each image, this produces 512 feature maps of size 15×20. These are reshaped into 300 vectors of dimension 512 (the hidden dimension). Positional embeddings are added so the model knows which spatial region each vector came from.

2. Robot proprioception — The current state of the robot (14 floats: 7 joint positions + 7 joint velocities, or similar) is projected to dimension 512 using a single linear layer.

3. Style variable z — A latent variable from the CVAE (more on this in step 7). During training, this carries information about the specific trajectory. During inference, it's just zeros.

ACT encoder: CNN features + robot state + style variable z → contextual embeddings

The reshaping operation is worth pausing on: each of the 300 vectors concatenates all 512 feature activations at a particular spatial location. So vector i represents "everything the CNN sees" at spatial position i of the image.

All these inputs get concatenated into a single sequence and passed through the transformer encoder's self-attention layers. The output: a set of embeddings that capture the full context of the current observation.

Why use a pretrained CNN backbone instead of feeding raw pixels into the transformer?
Raw pixels are high-dimensional and spatially redundant. A CNN backbone like ResNet-18 compresses the image into semantically meaningful features (edges, textures, objects) at reduced spatial resolution. This dramatically reduces the sequence length the transformer needs to process (300 vectors vs. thousands of pixel patches) and provides better inductive biases for visual features.
🎯

DETR-Style Decoder

Learned queries replace autoregressive generation
act detr architecture

This is the most ingenious part of ACT. In a standard transformer decoder (like GPT), you feed in the output tokens you've generated so far and predict the next one. This requires causal masking and sequential generation.

ACT borrows from DETR (DEtection TRansformer), an object detection model. DETR's key innovation: instead of feeding actual data into the decoder, you feed in learned query embeddings — fixed vectors that the model learns during training.

In DETR, each query learns to "look for" a particular object in the image. In ACT, each query represents a time step in the action sequence. If the chunk size is 100, there are 100 learned queries, one per predicted action.

DETR-style decoder: learned queries attend to encoder output → parallel action prediction

Critical implementation detail: The actual input to the decoder is a vector of all zeros. The learned query embeddings are added as positional encodings at every decoder layer:

DETR decoder cross-attention
tgt2 = self.multihead_attn( query=self.with_pos_embed(tgt, query_pos), # zeros + learned queries key=self.with_pos_embed(memory, pos), # encoder output + pos value=memory, # encoder output attn_mask=memory_mask, key_padding_mask=memory_key_padding_mask )[0]

Because we're not passing in any ground-truth actions, there's no need for causal masking. Each query can attend to all other queries and to the full encoder output. This means when predicting action at time step t, the model can consider what it's predicting at time steps t-1 and t+1.

The decoder input is zeros. The queries are positional encodings. Each layer adds queries to the running state, giving free residual connections that help gradient flow. No ground truth goes in, so no causal mask is needed.
What are two advantages of the DETR-style decoder over an autoregressive decoder for action prediction?
(1) Bidirectional reasoning: each predicted action can be influenced by both earlier and later actions in the chunk, leading to more coherent trajectories. (2) Parallel generation: all actions are predicted in a single forward pass, no sequential loop needed — this is much faster at inference time.
🔗

Encoder ↔ Decoder

How the full architecture connects
act architecture

Let's put the full picture together. The ACT architecture has two main flows during training:

Right side (blue): The main encoder-decoder transformer.

  • Encoder receives: CNN image features + robot state + style variable z
  • Decoder receives: zeros as input, learned query embeddings as positional encoding
  • Output: predicted actions for each time step in the chunk

Left side (gray): The CVAE encoder (a separate, smaller transformer).

  • Sees the ground-truth action sequence we're trying to predict
  • Compresses it into the style variable z
  • Only used during training — discarded at inference
Full ACT architecture: CVAE encoder (left) feeds style variable z to the main encoder-decoder (right)

The loss function is straightforward: L1 distance between predicted and actual joint positions/actions at each time step. No fancy reward shaping, no reinforcement learning — just supervised regression on demonstration data.

This simplicity is part of ACT's appeal. The architectural innovations (DETR queries + CVAE) do the heavy lifting, allowing a simple loss to be highly effective.

🎨

The Style Variable z

Handling multimodality with a CVAE
cvae training

Here's a fundamental problem with imitation learning: there are infinitely many valid trajectories from point A to point B. Your demonstrations might reach for an object from the left, from the right, from above — all valid, all different.

If you simply train a model to minimize L1 loss against demonstrations, it might learn the average of all trajectories — which could be a trajectory that goes nowhere useful. This is the "multimodality problem."

Without z: model averages trajectories (bad). With z: model can generate specific trajectory styles.

ACT's solution: the Conditional Variational Autoencoder (CVAE) style variable.

During training: A separate encoder (the left-side transformer) looks at the ground-truth action sequence and compresses its "style" — the unique characteristics of this particular trajectory — into a low-dimensional variable z. This z is then passed to the main encoder as additional context.

The main encoder-decoder learns: "Given these images, this robot state, and this style hint, predict these specific actions." The style variable resolves the ambiguity.

During inference: We discard the CVAE encoder entirely. We set z = 0 (zeros). This says to the model: "I don't care about any particular style — just give me a reasonable trajectory."

Think of z as a "trajectory fingerprint." During training, it tells the model which specific demonstration style to reproduce. During inference, setting z=0 asks for the "average" style — but now the model has learned to produce coherent trajectories, not blurry averages, because each training example had a specific z.
Why does setting z=0 at inference produce good results instead of the "average trajectory" problem we were trying to avoid?
During training, the model learned to generate trajectories conditioned on z. The KL divergence loss in the CVAE encourages z to be normally distributed around zero. So z=0 represents the most "typical" or "default" trajectory style, not an average of trajectories. The model has learned to produce a single coherent trajectory for any given z — z=0 just happens to select the most common style.
📦

Action Chunking

Why predicting n steps beats predicting 1
chunking act

Chunking is the idea that gives ACT its name. Instead of predicting the next single action, the model predicts the next n actions (a "chunk") all at once.

The chunk size is a hyperparameter — typically 50 to 100 time steps. At each observation, the model outputs a full trajectory chunk, but in practice you execute some fraction of it before querying the model again.

The surprising finding: chunking dramatically improves performance across all methods, not just ACT:

Performance vs. chunk size: larger chunks improve results up to a point, then plateau

Why does chunking work? The explanation draws from cognitive science:

Humans don't plan movement at the level of individual muscle activations. We think in terms of cohesive actions: "reach for the cup," "lift it up," "bring it to my mouth." We chunk movements into meaningful units.

The same principle applies to neural networks. It's "easier" to learn and reproduce sequences of coordinated movements than individual atomic actions. Predicting a chunk forces the model to reason about the intent behind a movement, not just the next incremental step.

Chunking is not ACT-specific — it's a general principle. Any method that predicts action sequences improves with chunking. This suggests that the temporal structure of actions has an inherent "chunk size" that models benefit from matching.

In practice, temporal ensembling smooths out the transitions between chunks. When you query the model at time t and again at t+k, the overlapping predictions are averaged with exponential weighting, giving priority to more recent predictions.

If chunk size is 100 and you re-query the model every 10 steps, how many overlapping action predictions exist for step t=50? Why is this overlap useful?
At step t=50, you'd have predictions from queries made at t=0, t=10, t=20, t=30, t=40, and t=50 — that's 6 overlapping predictions. The overlap is useful because it enables temporal ensembling: averaging multiple predictions from different viewpoints reduces jitter and creates smoother, more robust trajectories.

Training & Inference

Putting it all together
training inference act

Training pipeline:

  1. Sample a demonstration trajectory and pick a random time step t
  2. Feed the image(s) at time t and robot state at time t to the main encoder
  3. Feed the ground-truth action chunk (steps t to t+chunk_size) to the CVAE encoder → get z
  4. Pass z to the main encoder alongside image features and state
  5. Decoder uses learned queries to predict the action chunk
  6. Compute L1 loss between predicted and actual actions + KL divergence on z
  7. Backpropagate and update all parameters

Inference pipeline (simplified):

  1. Capture current camera image(s) and robot state
  2. Set z = 0 (discard CVAE encoder)
  3. Forward pass through the main encoder-decoder
  4. Get predicted action chunk (e.g., 100 future actions)
  5. Execute actions with temporal ensembling from previous chunks
  6. Repeat from step 1 every k steps
At inference, the CVAE encoder is gone. The model is just: images + state → encoder → decoder with learned queries → chunk of actions. Simple, fast, effective.

The results speak for themselves. ACT dramatically outperforms previous behavior cloning methods across a range of manipulation tasks. The combination of DETR-style parallel decoding, CVAE for multimodality, and action chunking creates an approach that's both elegant and practical.

The key takeaways:

  • Parallel decoding via learned queries — no causal mask, bidirectional reasoning
  • CVAE style variable — resolves trajectory multimodality during training
  • Action chunking — forces the model to reason about movement intent
  • Temporal ensembling — smooth transitions between prediction windows

Each piece is important, but it's their combination that makes ACT so potent. The transformer's ability to pick up on subtle patterns is unparalleled, and these architectural choices give it the right inductive biases for robot control.

🧪

Recap & Self-Test

Test your understanding of the full architecture
quiz act

You've now walked through every component of ACT. Let's make sure it all clicks with some harder questions:

1. What goes into the decoder as input values, and what role do the learned queries play?
The decoder receives a tensor of zeros as input values. The learned query embeddings are added as positional encodings at every decoder layer. This means the queries shape the representation through repeated addition (residual-like connections), while the actual "content" comes entirely from cross-attending to the encoder output.
2. Why can't you just train a standard sequence-to-sequence model with teacher forcing for action prediction?
Teacher forcing with causal masking means action t can only see actions 0 to t-1. This prevents the model from considering future context when generating each action. For robot trajectories where global coherence matters (e.g., the reach angle affects the entire subsequent grasp motion), this is a significant limitation. The DETR approach allows each action to attend to all others.
3. If you removed the CVAE (style variable z) from ACT, what specific failure mode would you expect?
Without z, the model would try to minimize L1 loss against all demonstration trajectories simultaneously. When demonstrations contain different "styles" (e.g., reaching from the left vs. the right), the model would learn the mean trajectory, which could be a path that goes through the middle — physically invalid or suboptimal. The CVAE resolves this by conditioning each training example on its specific style.
4. The CNN backbone produces 512 features of dimension 15x20. After reshaping, we get 300 vectors of dim 512. Walk through this reshape step-by-step.
The CNN outputs a feature tensor of shape [512, 15, 20] — 512 channels, each a 15x20 spatial grid. The reshape transposes this to [15*20, 512] = [300, 512]. Each of the 300 vectors corresponds to one spatial position in the feature grid and contains all 512 channel values at that position. This converts a spatial feature map into a sequence suitable for the transformer.
5. How does temporal ensembling work, and what problem does it solve?
When the model is queried every k steps with chunk size n, consecutive chunks overlap by n-k steps. For each time step, multiple predictions exist from different query times. Temporal ensembling computes a weighted average (exponentially weighted, favoring more recent predictions) across all predictions for that step. This solves the problem of jerky transitions at chunk boundaries and makes the executed trajectory smoother.

Papers to read next:

  • Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (Zhao et al., 2023) — the ACT paper
  • End-to-End Object Detection with Transformers (Carion et al., 2020) — DETR
  • Attention Is All You Need (Vaswani et al., 2017) — the original transformer