Problems and solutions in offline learning

Current reinforcement learning algorithms require tons of data to solve new tasks. This “hunger for data” is especially onerous in high stake scenarios like healthcare or education, where on-policy data collection is infeasible.

This motivates the Batch RL setting where the learner is only given a batch of transition tuples but allowed no further interactions with the environment. As this data was not necessarily collected by an expert it is no option to do imitation learning. While Q-learning and derived methods like DQN or SAC are in theory capable of doing off-policy learning, they dramatically fail to do so.

In this talk we will examine the cause for this surprising failure of powerful off-policy algorithms and proposals to fix it. Recent BatchRL papers provide also a tale of disagreement on evaluation scenarios.

References

[Fuj19O]

Off-Policy Deep Reinforcement Learning without Exploration, Scott Fujimoto, David Meger, Doina Precup.

Aug 2019

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with data …

Publication

[Kum19S]

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction, Aviral Kumar, Justin Fu, George Tucker, Sergey Levine.

Nov 2019

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards …

Publication

[Wu19B]

Behavior Regularized Offline Reinforcement Learning, Yifan Wu, George Tucker, Ofir Nachum.

Nov 2019

In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a …

Publication

In this series →

Reinforcement Learning

Trainings: Safe and efficient deep reinforcement … Trainings: Machine Learning and Control Software: Tianshou: An elegant deep reinforcement … Seminar: Dense Rewards and Continual RL for … Seminar: Episode-based RL with Movement Primitive Seminar: Double Gumbel Q-learning Pills: Hyperparameters in Reinforcement … Pills: Exploiting past success in Off-Policy … Pills: Jump-Start Reinforcement Learning Pills: Implicit Q Learning Pills: Advantage-Induced Policy Alignment Pills: Critic Regularized Regression Pills: Deep Reinforcement Learning at the Edge … Pills: Planning with Diffusion for Flexible … Blog: Natural, Trust Region and Proximal … Seminar: Neural Network Dynamics for Model-Based … Seminar: A short introduction to model-based …