In part 3 on Normalizing Flows, we will discuss how Reinforcement Learning could benefit from this class of methods for policy representation:

**Using Normalizing Flows to represent stochastic policies in reinforcement
learning:**

I assume that everyone has a basic understanding of RL. Therefore, I will only briefly review the fundamentals (state, action, environment, …)

The objective of RL is to maximize some quantity of return, usually the expected (discounted) sum of rewards per time step.

Actions are taken w.r.t. a current policy

$$\pi(a_t | s_t)$$

We focus on stochastic policies. In this case, the policy is a (sometimes complicated) probability distribution. Our RL algorithm tries to find a parametrized policy that most closely resembles an optimal policy w.r.t. return. There are many ways to optimize that policy, e.g. with policy gradients.

Most RL papers nowadays represent their modeled policy with a NN that outputs parameters of a fixed distribution, e.g. mu, sigma of normal distribution.

I present the idea how normalizing flows can be used to represent RL policies.