The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations. Considering the more general Fenchel-Rockafellar Duality is the backbone of a variety of recent methods for off-policy evaluation and off-policy optimisation. We will see in particular, that regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.
Reinforcement Learning via Fenchel-Rockafellar Duality,
We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including …