The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations. Considering the more general Fenchel-Rockafellar Duality is the backbone of a variety of recent methods for off-policy evaluation and off-policy optimisation. We will see in particular, that regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.
Duality in RL and the [*]-DICE Family (part 1)
References
[Nac20R]
Reinforcement Learning via Fenchel-Rockafellar Duality,