The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations.
This talk addresses the offline setting, in which the agent learns from a fixed dataset without interacting with the environment. After recapitulating the LP duality in RL I will introduce the Fenchel-Rockafellar Duality that is the backbone of a variety of recent methods for offline policy evaluation and offline optimisation. Two core difficulties of offline RL are the distribution shift in data and the generalisation/extrapolation problem. We will how regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.