Dense Rewards and Continual RL for Task-Oriented Dialogue Policies

Christian will present a proposal for dense rewards in task-oriented dialogue systems to enhance sample efficiency and discuss continual reinforcement learning of dialogue policies. Key topics include an architecture for continual learning, an extended learning environment, lifetime return optimization, and meta-reinforcement learning for hyperparameter adaptation.

Abstract

Task-oriented dialogue systems help users achieve a certain task, such as finding and booking hotels or restaurants, helping with travel planning, organizing calenders, etc. Traditionally, task-oriented dialogue systems have at least two components: 1) a dialogue state tracker that keeps track of the dialogue history, and 2) a dialogue policy that decides on the next action to take in order to steer the conversation towards task success. The dialogue policy is typically optimized using reinforcement learning, where the reward signal is mainly defined through the success or failure of the dialogue, leading to a sparse reward problem. Moreover, dialogue policies operate within a certain scope of operation, defined by an ontology, and have never been equipped with continual learning abilities that are inevitably required for learning new tasks over time. In this talk, which is an extended version of my PhD defence presentation, I will present a proposal for dense rewards in dialogue to increase sample efficiency. Moreover, we will discuss continual reinforcement learning of dialogue policies, including an architecture to enable continual dialogue policy learning, an extended continual learning environment for dialogue policies, lifetime return optimization, and meta-reinforcement learning for continual hyperparameter adaptation.

In this series