Understanding Diffusion Policy: A Simple Guide

DDPM: Denoising Diffusion Probabilistic Models

The core idea of DDPM is to gradually corrupt data with Gaussian noise and then train a model to reverse this process, recovering the original signal.

x_0 \;\;\longrightarrow\;\; x_t \;\;\longrightarrow\;\; x_T \sim \mathcal N(0, I)

(where (x_0) is the clean sample, (x_t) is the noisy version at timestep (t), and (x_T) approaches pure white noise.)

The denoising model learns to predict the noise at each step, so that the reverse process can be expressed.

But here is denoiser, we can use ‘Noisy trajectory’ instead of ‘Noisy Image’ to get denoised trajectory and put image instead of instruction and using Resnet18 instead of ‘Text Encoder’ to use this model as a diffusion policy.

DDPM

UNet Architecture

The noise predictor network in DDPM is typically a UNet, a convolutional encoder–decoder with skip connections.

Encoder: progressively downsamples the input to extract hierarchical features.
Decoder: upsamples and reconstructs, combining semantic and spatial information.

This structure is effective for tasks requiring high-resolution outputs, such as segmentation, super-resolution, and diffusion-based generation.

unet

The training objective often uses mean squared error (MSE) between predicted and true noise:

L = \frac{1}{N} \sum_{i=1}^{N} \big( y_i - \hat{y}_i \big)^2

Features

Diffusion Policy applies the denoising diffusion process to policy learning from demonstrations. Instead of reconstructing images, the model generates action sequences.

Key characteristics:

Policy learning from demonstrations: imitation learning.
Supervised regression: a supervised learning setup where the output consists of continuous values.
Sequential correlation: captures temporal dependencies in robot actions.
Multimodal distribution: allows sampling diverse but plausible actions.

pusht

Action representations: how the policy outputs are structured (e.g., Gaussian mixtures, quantized actions).
Implicit policies: leverage the diffusion process to generate actions step by step, without explicit reward optimization.

Structure of Diffusion Policy

diffusion-policy

(Paper link)

The model takes multi-view camera observations (top camera, wrist camera) and robot pose as input.
The denoising network (ε_θ) is implemented using either a ResNet-based CNN or a time-series diffusion Transformer.
The output is an action sequence ( A_t ) over the prediction horizon.

Key aspects:

Score function gradient: learns the gradient of the log density, guiding the denoising process.
Stochastic Langevin dynamics: applied at inference time to sample from the learned policy, improving stability compared to direct energy-based approaches.

Summary

DDPM: transforms white noise into structured outputs by iterative denoising.
UNet: the backbone architecture for noise prediction.
Diffusion Policy: applies diffusion modeling to sequential decision-making, enabling robust imitation of complex behaviors.

Together, these approaches provide a powerful framework for learning policies in robotics and control using probabilistic generative models.

Understanding Diffusion Policy: A Simple Guide

DDPM: Denoising Diffusion Probabilistic Models

UNet Architecture

Features

Structure of Diffusion Policy

Summary

Related Posts

Understanding π₀: A Simple Guide

Understanding OpenVLA: A Simple Guide

Understanding Action Chunking with Transformers (ACT): A Simple Guide

Understanding FABRIK: A Simple Guide