Diffusion Imitation from Observation

National Taiwan University

Abstract

Learning from observation (LfO) aims to imitate experts by learning from state-only demonstrations without requiring action labels. Existing adversarial imitation learning approaches learn a generator agent policy to produce state transitions that are indistinguishable to a discriminator that learns to classify agent and expert state transitions. Despite its simplicity in formulation, these methods are often sensitive to hyperparameters and brittle to train. Motivated by the recent success of diffusion models in generative modeling, we propose to integrate a diffusion model into the adversarial imitation learning from observation framework. Specifically, we employ a diffusion model to capture expert and agent transitions by generating the next state, given the current state. Then, we reformulate the learning objective to train the diffusion model as a binary classifier and use it to provide ''realness'' rewards for policy learning. Our proposed framework, Diffusion Imitation from Observation (DIFO), demonstrates superior performance in various continuous control domains, including navigation, locomotion, manipulation, and games.

Framework Overview

DIFO model framework

We propose Diffusion Imitation from Observation (DIFO), a novel adversarial imitation learning from observation framework employing a conditional diffusion model. (a) Learning diffusion discriminator. In the discriminator step the diffusion model learns to model a state transition (\(\mathbf{s}, \mathbf{s}'\)) by conditioning on the current state \(\mathbf{s}\) and generates the next state \(\mathbf{s}'\). With the additional condition on binary expert and agent labels (\(c_E/c_A\)), we construct the diffusion discriminator to distinguish expert and agent transitions by leveraging the single-step denoising loss as a likelihood approximation. (b) Learning policy with diffusion reward. In the policy step, we optimize the policy with reinforcement learning according to rewards calculated based on the diffusion discriminator's output \(\log(1 - \mathcal{D}_{\phi}(\mathbf{s},\mathbf{s'}))\).

Environment

Environment Description

We experiment in various continuous control domains, including navigation, locomotion, manipulation, and games.

(a) PointMaze: A navigation task for a 2-DoF agent in a medium maze.

(b) AntMaze: A locomotion and navigation task where a quadruped ant navigates from an initial position to a randomly sampled goal by controlling the torque of its legs.

(c) FetchPush: A 7-DoF Fetch robot arm is tasked with pushing a block to a randomly sampled target position on a table.

(d) AdroitDoor: A manipulation task to undo a latch and swing open a randomly placed door.

(e) Walker: A locomotion task involving a 6-DoF Walker2D in MuJoCo.

(f) OpenMicrowave: A manipulation task controlling a 9-DoF Franka robot arm to open a microwave door.

(g) CarRacing: An image-based control task where a car completes randomly generated tracks as quickly as possible.

(h) CloseDrawer: An image-based manipulation task controlling a Sawyer robot arm to close a drawer.

Learning performance and efficiency

Learning performance and efficiency

Our proposed method, DIFO, consistently outperforms or matches the best-performing baselines across various tasks, demonstrating the effectiveness of integrating a conditional diffusion model into the AIL framework. In environments like AntMaze, AdroitDoor, and CarRacing, DIFO converges faster, modeling expert behavior efficiently in high-dimensional spaces while providing stable training results with low variance. Compared to behavior cloning (BC), which struggles due to its reliance on expert datasets and covariate shifts, DIFO benefits from online interactions to generate transition-level rewards. Unlike Optimal Transport (OT), which struggles with diverse trajectories, DIFO excels by identifying transition similarities.

Variants like DIFO-Uncond and DIFO-NA perform poorly, except in limited cases like CloseDrawer, emphasizing the necessity of agent-environment interactions to avoid policy exploitation and instability. We evaluated all methods on multiple tasks, showing that DIFO delivers more stable and faster learning across the board.

Data efficiency

Data efficiency

We vary the amount of available expert demonstrations in AntMaze. Our proposed method DIFO consistently outperforms other methods when the number of expert demonstrations decreases, highlighting the data efficiency of DIFO.

Generating data using diffusion models

Generating data using diffusion models

We take a trained diffusion discriminator of DIFO and autoregressively generate a sequence of next states starting from an initial state sampled in the expert dataset. We visualize four pairs of expert trajectories and the corresponding generated trajectories above.

The results show that our diffusion model can accurately generate trajectories similar to those of the expert. It is worth noting that the diffusion model can generate trajectories that differ from the expert trajectories while still completing the task, such as the example on the bottom right of the Figure, where the diffusion model produces even shorter trajectories than the scripted expert policy.

Visualized learned reward functions

Visualized learned reward functions

Reward function visualization and generated distribution on SINE. (a) The expert state transition distribution. (b) The state transition distribution generated by the DIFO diffusion model. (c-d) The visualized reward functions learned by GAIfO and DIFO, respectively. DIFO produces smoother rewards outside of the expert distribution, allowing for facilitating policy learning.

Ablation study

\(\lambda_{MSE}\) and \(\lambda_{BCE}\)

Ablation study on \(\lambda_{MSE}\) and \(\lambda_{BCE}\)

We hypothesize that both \(\lambda_{MSE}\) and \(\lambda_{BCE}\) are important for efficiency learning. To examine the effect of \(\lambda_{MSE}\) and \(\lambda_{BCE}\) and verify the hypothesis, we vary the ratio of \(\lambda_{MSE}\) and \(\lambda_{BCE}\) in PointMaze and Walker, including \(\lambda_{BCE}\) only and \(\lambda_{MSE}\) only, i.e., \(\lambda_{MSE}\) = 0 and \(\lambda_{BCE}\) = 0. As shown in Figure 7, the results emphasize the significance of introducing both \(\lambda_{MSE}\) and \(\lambda_{BCE}\), since they enable the model to simultaneously model expert behavior (\(\lambda_{MSE}\)) and perform binary classification (\(\lambda_{BCE}\)). Without \(\lambda_{MSE}\), the performance slightly decreases as it does not modeling expert behaviors. Without \(\lambda_{BCE}\), the model fails to learn as it does not utilize negative samples, i.e., agent data. Moreover, when we vary the ratio of \(\lambda_{MSE}\) and \(\lambda_{BCE}\), DIFO maintains stable performance, demonstrating DIFO is relatively insensitive to hyperparameter variations.

Number of samples for reward computation

Ablation study on the number of samples for reward computation

To investigate the robustness of our rewards, we conducted experiments with varying numbers of denoising step samples in PointMaze and Walker. We take the mean of losses computed from different numbers of samples, i.e., multiple t, to compute rewards. As presented in the Figure above, the performance of DIFO is stable under different numbers of samples. As a result, we use a single denoising step sample to compute the reward for the best efficiency.

BibTeX

@article{huang2024DIFO,
  author    = {Huang, Bo-Ruei and Yang, Chun-Kai and Lai, Chun-Mao and Wu, Dai-Jie and Sun, Shao-Hua},
  title     = {Diffusion Imitation from Observation},
  journal   = {38th Conference on Neural Information Processing Systems (NeurIPS 2024)},
  year      = {2024},
}