Current reinforcement learning algorithms require tons of data to solve new tasks. This “hunger for data” is especially onerous in high stake scenarios like healthcare or education, where on-policy data collection is infeasible.
This motivates the Batch RL setting where the learner is only given a batch of transition tuples but allowed no further interactions with the environment. As this data was not necessarily collected by an expert it is no option to do imitation learning. While Q-learning and derived methods like DQN or SAC are in theory capable of doing off-policy learning, they dramatically fail to do so.
In this talk we will examine the cause for this surprising failure of powerful off-policy algorithms and proposals to fix it. Recent BatchRL papers provide also a tale of disagreement on evaluation scenarios.