Normalizing Flows for Policy Representation in Reinforcement Learning

In part 3 on Normalizing Flows, we will discuss how Reinforcement Learning could benefit from this class of methods for policy representation:

Using Normalizing Flows to represent stochastic policies in reinforcement learning:

I assume that everyone has a basic understanding of RL. Therefore, I will only briefly review the fundamentals (state, action, environment, …)
The objective of RL is to maximize some quantity of return, usually the expected (discounted) sum of rewards per time step.
Actions are taken w.r.t. a current policy
$$\pi(a_t | s_t)$$
We focus on stochastic policies. In this case, the policy is a (sometimes complicated) probability distribution. Our RL algorithm tries to find a parametrized policy that most closely resembles an optimal policy w.r.t. return. There are many ways to optimize that policy, e.g. with policy gradients.
Most RL papers nowadays represent their modeled policy with a NN that outputs parameters of a fixed distribution, e.g. mu, sigma of normal distribution.
I present the idea how normalizing flows can be used to represent RL policies.

References

[Kob20N]

Normalizing Flows: An Introduction and Review of Current Methods, Ivan Kobyzev, Simon J. D. Prince, Marcus A. Brubaker.

Apr 2020

Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review current …

[Pap21N]

Normalizing flows for probabilistic modeling and inference, George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, Balaji Lakshminarayanan.

Jan 2021

Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of …

[Maz20L]

Leveraging exploration in off-policy algorithms via normalizing flows, Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, R. Devon Hjelm.

May 2020

The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural dens...

Publication

[War19I]

Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies, Patrick Nadeem Ward, Ariella Smofsky, Avishek Joey Bose.

Jun 2019

Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are known to be brittle toward hyperparameters as well as \cut{being}sample inefficient. Soft Actor Critic (SAC) proposes an off-policy deep actor critic algorithm within the maximum entropy RL framework which offers greater stability and empirical gains. The choice of policy distribution, a factored Gaussian, is motivated …

Publication

References

In this series →