Duality in RL and the [*]-DICE Family (part 2)

The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations.

This talk addresses the offline setting, in which the agent learns from a fixed dataset without interacting with the environment. After recapitulating the LP duality in RL I will introduce the Fenchel-Rockafellar Duality that is the backbone of a variety of recent methods for offline policy evaluation and offline optimisation. Two core difficulties of offline RL are the distribution shift in data and the generalisation/extrapolation problem. We will how regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.