· PhysicalAI  Â· 2 min read

Understanding Diffusion Policy: A Simple Guide

Diffusion-based models such as DDPM and their use in policy learning rely on denoising mechanisms, UNet architectures, and structured action representations to capture complex sequential behaviors.

Diffusion-based models such as DDPM and their use in policy learning rely on denoising mechanisms, UNet architectures, and structured action representations to capture complex sequential behaviors.

DDPM: Denoising Diffusion Probabilistic Models

The core idea of DDPM is to gradually corrupt data with Gaussian noise and then train a model to reverse this process, recovering the original signal.

x0    ⟶    xt    ⟶    xT∼N(0,I)x_0 \;\;\longrightarrow\;\; x_t \;\;\longrightarrow\;\; x_T \sim \mathcal N(0, I)

(where (x_0) is the clean sample, (x_t) is the noisy version at timestep (t), and (x_T) approaches pure white noise.)

The denoising model learns to predict the noise at each step, so that the reverse process can be expressed.

But here is denoiser, we can use ‘Noisy trajectory’ instead of ‘Noisy Image’ to get denoised trajectory and put image instead of instruction and using Resnet18 instead of ‘Text Encoder’ to use this model as a diffusion policy.

DDPM


UNet Architecture

The noise predictor network in DDPM is typically a UNet, a convolutional encoder–decoder with skip connections.

  • Encoder: progressively downsamples the input to extract hierarchical features.
  • Decoder: upsamples and reconstructs, combining semantic and spatial information.

This structure is effective for tasks requiring high-resolution outputs, such as segmentation, super-resolution, and diffusion-based generation.

unet

The training objective often uses mean squared error (MSE) between predicted and true noise:

L=1N∑i=1N(yi−y^i)2L = \frac{1}{N} \sum_{i=1}^{N} \big( y_i - \hat{y}_i \big)^2

Features

Diffusion Policy applies the denoising diffusion process to policy learning from demonstrations. Instead of reconstructing images, the model generates action sequences.

Key characteristics:

  • Policy learning from demonstrations: imitation learning.
  • Supervised regression: a supervised learning setup where the output consists of continuous values.
  • Sequential correlation: captures temporal dependencies in robot actions.
  • Multimodal distribution: allows sampling diverse but plausible actions.

pusht

  • Action representations: how the policy outputs are structured (e.g., Gaussian mixtures, quantized actions).
  • Implicit policies: leverage the diffusion process to generate actions step by step, without explicit reward optimization.

Structure of Diffusion Policy

diffusion-policy

(Paper link)

The model takes multi-view camera observations (top camera, wrist camera) and robot pose as input.
The denoising network (ε_θ) is implemented using either a ResNet-based CNN or a time-series diffusion Transformer.
The output is an action sequence ( A_t ) over the prediction horizon.

Key aspects:

  • Score function gradient: learns the gradient of the log density, guiding the denoising process.
  • Stochastic Langevin dynamics: applied at inference time to sample from the learned policy, improving stability compared to direct energy-based approaches.

Summary

  • DDPM: transforms white noise into structured outputs by iterative denoising.
  • UNet: the backbone architecture for noise prediction.
  • Diffusion Policy: applies diffusion modeling to sequential decision-making, enabling robust imitation of complex behaviors.

Together, these approaches provide a powerful framework for learning policies in robotics and control using probabilistic generative models.

Back to Blog

Related Posts

View All Posts »
Understanding π₀: A Simple Guide

Understanding π₀: A Simple Guide

π₀ is an innovative VLA model that combines a vision–language backbone with an action expert module and flow matching, producing continuous action sequences from natural language and images.

Understanding OpenVLA: A Simple Guide

Understanding OpenVLA: A Simple Guide

OpenVLA is a 7B open-source VLA model built on Llama2 + DINOv2 + SigLIP, trained on 970k demos, achieving stronger generalization and robustness than closed RT-2-X (55B) and outperforming Diffusion Policy.

Understanding FABRIK: A Simple Guide

Understanding FABRIK: A Simple Guide

FABRIK(A Fast Iterative Solver for the Inverse Kinematics Problem) is a heuristic, iterative inverse kinematics solver that avoids complex matrix operations and singularities, providing smooth motion and fast convergence for robotic chains.