· Hankyu Kim · PhysicalAI  · 3 min read

Action Chunking with Transformers (ACT)

Action Chunking with Transformers (ACT) combines the representational strength of autoencoders with the contextual modeling of transformers, producing compact latent variables that generate coherent action sequences.

Action Chunking with Transformers (ACT) combines the representational strength of autoencoders with the contextual modeling of transformers, producing compact latent variables that generate coherent action sequences.

Action Chunking with Transformers (ACT) is not one of the standard Vision-Language-Action (VLA) models, but it is the imitation learning policies trained from scratch that contributed physical ai ecosystem. Its input is robot current state and image as condition, and the output is 50 continuous action chunk.


Autoencoder, Variational Autoencoder, and Conditional VAE

An Autoencoder (AE) compresses input data into a latent representation and then reconstructs it.
It is mainly used for feature extraction, dimensionality reduction, noise reduction, and input structure learning.

x        encoder        z        decoder        x^x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

Here, (z) is the compressed latent vector.

A Variational Autoencoder (VAE) extends the AE by modeling the latent space as a probability distribution rather than a single point.
The encoder produces the parameters (\mu) and (\sigma) of a Gaussian distribution, from which a latent vector (z) is sampled.
This regularization enables smooth and continuous latent spaces, allowing both reconstruction and generation of new samples.

x        encoder        zN(μ,σ2)        decoder        x^x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \sim \mathcal{N}(\mu, \sigma^2) \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

A Conditional VAE (CVAE) further extends the VAE by introducing a condition (y) (e.g., context, goals).

x        encoder        zN(μ,σ2)        sampling, condition y        decoder        x^x \;\;\longrightarrow\;\; \text{encoder} \;\;\longrightarrow\;\; z \sim \mathcal{N}(\mu, \sigma^2) \;\;\longrightarrow\;\; \text{sampling, condition y} \;\;\longrightarrow\;\; \text{decoder} \;\;\longrightarrow\;\; \hat{x}

Structure of ACT

ACT employs a CVAE to learn robot policies.

ACT Architecture

(Paper link)

  • Encoder: Takes the current state (e.g., joint positions) and encodes it into a latent distribution (Z) using a transformer.
  • Conditioning: Contextual information such as sensory inputs or goals (here, four 480×640 images) conditions the latent space.
  • Decoder: Uses a transformer to reconstruct or predict future action sequences from the latent representation.

To break this into two parts:

  1. CVAE Encoder
    Encodes an action sequence into a latent distribution (mean and variance):

    CVAE Encoder    =    left transformer encoder\text{CVAE Encoder} \;\;=\;\; \text{left transformer encoder}
  2. CVAE Decoder
    Samples from the latent space and predicts the action sequence conditioned on context:

    CVAE Decoder    =    center transformer encoder + right transformer decoder\text{CVAE Decoder} \;\;=\;\; \text{center transformer encoder + right transformer decoder}

Recall) Transformers

Transformers differ from recurrent networks (RNNs, LSTMs) by processing input sequences in parallel while still capturing long-range dependencies.

  • Query (Q): Representation of the current token or state.
  • Key (K): Encoded representations of all tokens.
  • Value (V): Contextual embeddings carrying semantic content.

The attention mechanism computes relevance between queries and keys, then aggregates values accordingly:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

This enables selective recall of relevant past information when generating the next action.


Multi-Headed Attention

Instead of a single attention map, transformers compute multiple attention heads in parallel. Each head captures different relational patterns:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

For example, GPT-3 uses 96 heads, providing highly diverse contextual perspectives.


Encoder Layer

The encoder learns how tokens relate to one another within a sequence. Each layer consists of:

  1. Input Embedding – mapping inputs to vectors
  2. Positional Encoding – preserving order information
  3. Multi-Head Self-Attention – capturing inter-token dependencies
  4. Feed-Forward Network (FFN) – nonlinear feature transformation

This stack is repeated across multiple layers, producing progressively higher-level representations of the input sequence.


Decoder Layer

The decoder generates output sequences step by step. Each layer consists of:

  1. Masked Multi-Head Self-Attention – prevents the model from attending to future tokens during generation
  2. Cross-Attention – connects encoder outputs to the decoder’s state
  3. Feed-Forward Network – nonlinear transformation
  4. Linear Projection + Softmax – produces the probability distribution for the next token
  5. Autoregressive Generation – predicted tokens are fed back until an end-of-sequence token is produced

Summary

ACT leverages the compression power of CVAEs with the contextual modeling of transformers. By chunking sequences into latent actions and decoding them under context, ACT can:

  • Predict future behaviors
  • Reconstruct plausible trajectories
  • Generate new action patterns

This makes it a powerful framework for robotics, control, and other sequential decision-making tasks where both memory and adaptability are critical.

Back to Blog

Related Posts

View All Posts »
π₀

π₀

π₀ is an innovative VLA model that combines a vision–language backbone with an action expert module and flow matching, producing continuous action sequences from natural language and images.

Diffusion Policy

Diffusion Policy

Diffusion-based models such as DDPM and their use in policy learning rely on denoising mechanisms, UNet architectures, and structured action representations to capture complex sequential behaviors.

OpenVLA

OpenVLA

OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.

FABRIK

FABRIK

FABRIK(A Fast Iterative Solver for the Inverse Kinematics Problem) is a heuristic, iterative inverse kinematics solver that avoids complex matrix operations and singularities, providing smooth motion and fast convergence for robotic chains.