Understanding Action Chunking with Transformers (ACT): A Simple Guide

Action Chunking with Transformers (ACT) is not one of the standard Vision-Language-Action (VLA) models, but it is the imitation learning policies trained from scratch that contributed physical ai ecosystem. Its input is robot current state and image as condition, and the output is 50 continuous action chunk.

Autoencoder, Variational Autoencoder, and Conditional VAE

An Autoencoder (AE) compresses input data into a latent representation and then reconstructs it.
It is mainly used for feature extraction, dimensionality reduction, noise reduction, and input structure learning.

x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

Here, (z) is the compressed latent vector.

A Variational Autoencoder (VAE) extends the AE by modeling the latent space as a probability distribution rather than a single point.
The encoder produces the parameters (\mu) and (\sigma) of a Gaussian distribution, from which a latent vector (z) is sampled.
This regularization enables smooth and continuous latent spaces, allowing both reconstruction and generation of new samples.

x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \sim \mathcal{N}(\mu, \sigma^2) \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

A Conditional VAE (CVAE) further extends the VAE by introducing a condition (y) (e.g., context, goals).

x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \sim \mathcal{N}(\mu, \sigma^2) \;\;\longrightarrow\;\; \text{sampling, condition y} \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

Structure of ACT

ACT employs a CVAE to learn robot policies.

ACT Architecture

(Paper link)

Encoder: Takes the current state (e.g., joint positions) and encodes it into a latent distribution (Z) using a transformer.
Conditioning: Contextual information such as sensory inputs or goals (here, four 480×640 images) conditions the latent space.
Decoder: Uses a transformer to reconstruct or predict future action sequences from the latent representation.

To break this into two parts:

CVAE Encoder
Encodes an action sequence into a latent distribution (mean and variance):
$\text{CVAE Encoder} \;\;=\;\; \text{left transformer encoder}$
CVAE Decoder
Samples from the latent space and predicts the action sequence conditioned on context:
$\text{CVAE Decoder} \;\;=\;\; \text{center transformer encoder + right transformer decoder}$

Recall) Transformers

Transformers differ from recurrent networks (RNNs, LSTMs) by processing input sequences in parallel while still capturing long-range dependencies.

Query (Q): Representation of the current token or state.
Key (K): Encoded representations of all tokens.
Value (V): Contextual embeddings carrying semantic content.

The attention mechanism computes relevance between queries and keys, then aggregates values accordingly:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This enables selective recall of relevant past information when generating the next action.

Multi-Headed Attention

Instead of a single attention map, transformers compute multiple attention heads in parallel. Each head captures different relational patterns:

\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

For example, GPT-3 uses 96 heads, providing highly diverse contextual perspectives.

Encoder Layer

The encoder learns how tokens relate to one another within a sequence. Each layer consists of:

Input Embedding – mapping inputs to vectors
Positional Encoding – preserving order information
Multi-Head Self-Attention – capturing inter-token dependencies
Feed-Forward Network (FFN) – nonlinear feature transformation

This stack is repeated across multiple layers, producing progressively higher-level representations of the input sequence.

Decoder Layer

The decoder generates output sequences step by step. Each layer consists of:

Masked Multi-Head Self-Attention – prevents the model from attending to future tokens during generation
Cross-Attention – connects encoder outputs to the decoder’s state
Feed-Forward Network – nonlinear transformation
Linear Projection + Softmax – produces the probability distribution for the next token
Autoregressive Generation – predicted tokens are fed back until an end-of-sequence token is produced

Summary

ACT leverages the compression power of CVAEs with the contextual modeling of transformers. By chunking sequences into latent actions and decoding them under context, ACT can: