Duality in RL and the [*]-DICE Family (part 1)

The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations. Considering the more general Fenchel-Rockafellar Duality is the backbone of a variety of recent methods for off-policy evaluation and off-policy optimisation. We will see in particular, that regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.

References

[Nac20R]

Reinforcement Learning via Fenchel-Rockafellar Duality, Ofir Nachum, Bo Dai.

Jan 2020

We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including …