Planning with Diffusion for Flexible Behavior Synthesis

Diffusion models are finding their way into more machine learning domains. This paper introduces a framework for applying them to model-based reinforcement learning.

De-noising diffusion generative models have recently gained a lot of media attention thanks to the release of powerful text-to-image models such as DALLE-2, midjourney and stable diffusion. In our series on diffusion models we have discussed their key mathematical details, analysed the recent techniques for accelerating inference, and mentioned their usage in other research areas like molecular conformation generation.

What if we could use the effective information compression and generation of diffusion models for trajectory optimisation in model-based reinforcement learning?

Figure 1: Diffuser samples plans by iteratively denoising two-dimensional arrays consisting of a variable number of state-action pairs. A small receptive field constrains the model to only enforce local consistency during a single denoising step. By composing many denoising steps together, local consistency can drive global coherence of a sampled plan. An optional guide function J can be used to bias plans toward those optimizing a test-time objective or satisfying a set of constraints.

Model-based RL agents plan their actions according to a learned model of the environment, with the prediction of the best trajectory typically coming from classical optimisation routines. One major drawback is that trajectory optimisers usually exploit all inaccuracies in the learned model, which can result in suboptimal or unreachable states. On the positive side, model-based approaches are typically more stable towards sparse rewards and allow for reward-agnostic training, which is important for multitasking [Doy02M].

[Jan22P] introduces the Diffuser, a technique that uses diffusion models for model-based RL. Before training, several sequences of states and actions are sampled from the environment (this can be done at random or according to some pre-determined criterion). Then, noise is progressively added to each state and action until a certain (arbitrary) level of noise is reached. Note that the addition of noise typically results in transitions that may not be permitted by environment dynamics.

Given all these (noisy) trajectories, the Diffuser is trained to recover (i.e., progressively de-noise) the original (non-noisy) paths. In de-noising the trajectories, the agent learns the possible transitions state+action → next state that can be realised under environment dynamics. This is done in an iterative way by looking at local sequences of actions and states (the “local receptive field”, see Figure 1).

During this phase no reward is specified. As is typical in model-based RL, model training is only focused on learning to de-noise the trajectories towards realisable sequences of states and actions, and it does not aim to maximise any reward. Learned long-horizon planning: Diffuser’s learned planning procedure does not suffer from the myopic failure modes common to shooting algorithms and is able to plan over long horizons with sparse reward.

So, what happens at inference? Given a (smooth) reward function (which maps states and actions to scalars), the Diffuser can iteratively de-noise a trajectory both towards higher rewards and towards an environment-compatible dynamics. To prove that this is possible the authors rely on an important property of diffusion models, which is also the basis of classifier-guided sampling and text-conditional image generation in [Dha21D].

The biggest benefit of this process is that the entire trajectory is optimised at once, instead of optimising one step sequentially in time, as is typically done in model-free RL). This way, planning is more robust to long horizons and sparse rewards.

Table 1: Comparison of Diffusers to model-free RL agents in the task of navigating a maze given different end points (higher score is better)

The authors also give further comparisons to model-free and model-based agents in other environments, e.g., block stacking by a robotic arm or on the D4RL locomotion benchmarks, where the Diffuser is shown to perform comparably to the current best model-based methods.

(Planning as inpainting) Plans are generated in the Maze2D environment by sampling trajectories consistent with a specified start and goal condition. The remaining states are “inpainted” by the denoising process.

What we find most interesting about this approach is that the exploration-exploitation dilemma of normal RL models is solved in a non-standard way by completely separating learning and optimisation. This gives a lot of flexibility in the choice of rewards at inference, but it likely also has some drawbacks. For example, sampling the training trajectories in a large and highly complex environment without the guidance of a reward may result in poor navigation of certain important areas. In particular, a Diffuser that is trained to navigate the environment for more generic tasks likely performs worse on a specific reward than an agent which was trained only for that, especially if the environment is complex.

More information and some videos are available on the dedicated blog post, where you can also find an introductory presentation. Finally, the code and pre-trained models are available on the Diffusers github repository.