The field of reinforcement learning is extraordinarily active, with papers claiming to beat the state-of-the-art appearing almost weekly. At the same time, practitioners experience high variability in the performance of RL agents, with results in their specific tasks often not reflecting the claims in the literature.
In the important paper [Aga21D], a rigorous and lucid statistical analysis of the performance of various algorithms reveals several reasons behind this dichotomy:
- Most modern deep RL algorithms have a high degree of randomness in their performance scores.
- The way that scores are computed, aggregated and reported is often not statistically sound, resulting in unreliable statements and comparisons of algorithms
- Different evaluation protocols may be used in different papers, which complicates making a fair comparison of these works.
The authors perform a costly analysis with hundreds of runs per task on a benchmark. The per-task performance of an RL algorithm is treated as a random variable $X$. If a statistic is computed from $N$ runs $x_1, …, x_N$ on some task, for example the mean, then this sample-statistic is itself a realization of the random variable $\bar{X} := 1/N \sum_{i=1}^N X_i$, with each $X_i$ being an independent copy of $X$. Similarly, a task aggregated performance metric should be treated as a random variable. By performing many runs and subsampling different sets of $N$ runs from them, the authors study the distributions of these aggregated metrics for various modern algorithms.
The results are rather shocking. The aggregated metrics have very high variability for many RL algorithms, and on top of that the reported aggregated results fail to be close to the means of these distributions, often grossly over- or underestimating the performances, see Figure 2.1 and Figure 2.2.
Also, different ways of aggregating performances across tasks lead to different rankings of algorithms. For example, while Dreamer reaches the highest median score over all tasks (by far) in Atari200M, it is IQN which performs best in terms of being close or better to human performance on as many tasks as possible (measured by a newly introduced metric named optimality gap, which ignores by how “superhumanly” an algorithm performs), see below. As a consequence of the sound statistical analysis that was performed here, several conclusions drawn in the literature about superiority of one algorithm over another have to be reconsidered or even discarded.
Unfortunately, the obvious solution to this problem, which is to simply use more runs for evaluation, is often not feasible due to excessive amounts of compute needed. The authors go on to propose task-aggregated statistics that are better suited to estimating quantities of high variability from even few runs, like the inter-quantile mean (which can be seen as an intermediate of mean and median). They also recommend reporting performance profiles as in Figure 7 instead of single scalars as well as publishing results for all runs, to enable a future statistical analysis of the results. The techniques used in this analysis were published as an open-source library: rliable.
In a sense, this paper is “simply” a call for the application of well established, sound statistical methods for reporting results in RL. The authors propose to include their recommendations as acceptance requirements for popular ML conferences, thereby forcing the community to a certain level of rigor. The fact that this work received an Outstanding Paper Award (Top 0.07%) at NeurIPS 2021 means that the community has recognized these issues as important and that it is likely that future reporting of RL results will incorporate some of the recommendations and thereby improve significantly.
Personal comment: I believe this paper to be an important contribution and a call for action - without sound statistical analysis the reported results may be overly optimistic and cause frustration and disappointment in attempts at reproducing them, thereby undermining trust in deep RL in general. One point was missing though, I believe: the research community is (rightly) focused on general algorithms that perform well across many tasks, so nearly all comparisons are reported for a multitude of tasks simultaneously. However, in practice one is often interested in the performance on a single task. Analyzing the variability of different algorithms on a single task should therefore be of high interest for practitioners, but this kind of analysis in not performed nor recommended here.