Dense Rewards and Continual RL for Task-Oriented Dialogue Policies | TransferLab — appliedAI Institute

Dense Rewards and Continual RL for Task-Oriented Dialogue Policies

06 Jun 24 14:00 UTC

Download the slides

Series: Reinforcement Learning

Series: Reinforcement Learning

Recent and popular advances in Reinforcement Learning are known to be data-hungry. Attempts to handle this deficit include developing complex simulators while improving the sim2real transition of models, explicitly modelling the …

Christian Geishauser (appliedAI Initiative)

Christian Geishauser has started his position as Generative AI Engineer in the Innovation lab of the Applied AI Initiative GmbH in May 2024. Prior to that, he finished his PhD at the Heinrich Heine University Düsseldorf with the topic “Dense Rewards and Continual Reinforcement Learning for Task-oriented Dialogue Policies.” Christian is based in Munich, holding a Bachelor and Master degree in Mathematics from LMU Munich.

Christian will present a proposal for dense rewards in task-oriented dialogue systems to enhance sample efficiency and discuss continual reinforcement learning of dialogue policies. Key topics include an architecture for continual learning, an extended learning environment, lifetime return optimization, and meta-reinforcement learning for hyperparameter adaptation.

Abstract

Task-oriented dialogue systems help users achieve a certain task, such as finding and booking hotels or restaurants, helping with travel planning, organizing calenders, etc. Traditionally, task-oriented dialogue systems have at least two components: 1) a dialogue state tracker that keeps track of the dialogue history, and 2) a dialogue policy that decides on the next action to take in order to steer the conversation towards task success. The dialogue policy is typically optimized using reinforcement learning, where the reward signal is mainly defined through the success or failure of the dialogue, leading to a sparse reward problem. Moreover, dialogue policies operate within a certain scope of operation, defined by an ontology, and have never been equipped with continual learning abilities that are inevitably required for learning new tasks over time. In this talk, which is an extended version of my PhD defence presentation, I will present a proposal for dense rewards in dialogue to increase sample efficiency. Moreover, we will discuss continual reinforcement learning of dialogue policies, including an architecture to enable continual dialogue policy learning, an extended continual learning environment for dialogue policies, lifetime return optimization, and meta-reinforcement learning for continual hyperparameter adaptation.

In this series →

Reinforcement Learning

Trainings: Safe and efficient deep reinforcement … Trainings: Machine Learning and Control Software: Tianshou: An elegant deep reinforcement … Seminar: Episode-based RL with Movement Primitive Seminar: Double Gumbel Q-learning Pills: Hyperparameters in Reinforcement … Pills: Exploiting past success in Off-Policy … Pills: Jump-Start Reinforcement Learning Pills: Implicit Q Learning Pills: Advantage-Induced Policy Alignment Pills: Critic Regularized Regression Pills: Deep Reinforcement Learning at the Edge … Pills: Planning with Diffusion for Flexible … Blog: Natural, Trust Region and Proximal … Seminar: Problems and solutions in offline … Seminar: Neural Network Dynamics for Model-Based … Seminar: A short introduction to model-based …