Duality in RL and the [*]-DICE Family (part 1)

The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations. Considering the more general Fenchel-Rockafellar Duality is the backbone of a variety of recent methods for off-policy evaluation and off-policy optimisation. We will see in particular, that regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.