Deep Reinforcement Learning at the Edge of the Statistical Precipice

The field of reinforcement learning is extraordinarily active, with papers claiming to beat the state-of-the-art appearing almost weekly. At the same time, practitioners experience high variability in the performance of RL agents, with results in their specific tasks often not reflecting the claims in the literature.

In the important paper [Aga21D], a rigorous and lucid statistical analysis of the performance of various algorithms reveals several reasons behind this dichotomy:

Figure 2.1: [Aga21D] 95% confidence intervals (CIs) for median and inter-quantile mean (IQM) scores for varying N. There is a substantial uncertainty in median scores even with 50 runs. IQM has much smaller CIs than median. Note that when CIs overlap, properly accounting for uncertainty entails computing CIs for score differences.

Most modern deep RL algorithms have a high degree of randomness in their performance scores.
The way that scores are computed, aggregated and reported is often not statistically sound, resulting in unreliable statements and comparisons of algorithms
Different evaluation protocols may be used in different papers, which complicates making a fair comparison of these works.

The authors perform a costly analysis with hundreds of runs per task on a benchmark. The per-task performance of an RL algorithm is treated as a random variable $X$. If a statistic is computed from $N$ runs $x_1, …, x_N$ on some task, for example the mean, then this sample-statistic is itself a realization of the random variable $\bar{X} := 1/N \sum_{i=1}^N X_i$, with each $X_i$ being an independent copy of $X$. Similarly, a task aggregated performance metric should be treated as a random variable. By performing many runs and subsampling different sets of $N$ runs from them, the authors study the distributions of these aggregated metrics for various modern algorithms.

The results are rather shocking. The aggregated metrics have very high variability for many RL algorithms, and on top of that the reported aggregated results fail to be close to the means of these distributions, often grossly over- or underestimating the performances, see Figure 2.1 and Figure 2.2.

Figure 2.2: [Aga21D] Distribution of median normalized scores computed using 100,000 different sets of N runs subsampled uniformly with replacement from 100 runs. For a given algorithm, the sampling distribution shows the variation in the median scores when re-estimated using a different set of runs. The reported point estimates of median in publications, as shown by dashed lines, do not provide any information about the variability in median scores and severely overestimate or underestimate the expected median. We use the same number of runs as reported by publications: N = 5 runs for DER, OTR and DrQ, N = 10 runs for SPR and N = 20 runs for CURL.

Also, different ways of aggregating performances across tasks lead to different rankings of algorithms. For example, while Dreamer reaches the highest median score over all tasks (by far) in Atari200M, it is IQN which performs best in terms of being close or better to human performance on as many tasks as possible (measured by a newly introduced metric named optimality gap, which ignores by how “superhumanly” an algorithm performs), see below. As a consequence of the sound statistical analysis that was performed here, several conclusions drawn in the literature about superiority of one algorithm over another have to be reconsidered or even discarded.

Figure 9: [Aga21D] Aggregate metrics on Atari 200M with 95% CIs based on 55 games with sticky actions. Higher mean, median and IQM scores and lower optimality gap are better. The CIs are estimated using the percentile bootstrap with stratified sampling. IQM typically results in smaller CIs than median scores. Large values of mean scores relative to median and IQM indicate being dominated by a few high performing tasks, for example, DreamerV2 and M-IQN obtain normalized scores above 50 on the game JAMESBOND. Optimality gap is less susceptible to outliers compared to mean scores. We compare DQN (Nature), DQN with Adam optimizer, C51, REM, Rainbow, IQN, Munchausen-IQN (M-IQN), and DreamerV2. All results are based on 5 runs per game except for M-IQN and DreamerV2 which report results with 3 and 11 runs.

Unfortunately, the obvious solution to this problem, which is to simply use more runs for evaluation, is often not feasible due to excessive amounts of compute needed. The authors go on to propose task-aggregated statistics that are better suited to estimating quantities of high variability from even few runs, like the inter-quantile mean (which can be seen as an intermediate of mean and median). They also recommend reporting performance profiles as in Figure 7 instead of single scalars as well as publishing results for all runs, to enable a future statistical analysis of the results. The techniques used in this analysis were published as an open-source library: rliable.

Figure 7: [Aga21D] Performance profiles on Atari 100k based on score distributions (left), which we recommend, and average score distributions (right). Shaded regions show pointwise 95% confidence bands based on percentile bootstrap with stratified sampling. The profiles on the left are more robust to outliers and have smaller confidence bands. We use 10 runs to show the robustness of profiles with a few runs. For SimPLe, we use the 5 runs from their reported results. The $\tau$ value where the profiles intersect $y = 0.5$ shows the median while for a non-negative random variable, area under the performance profile corresponds to the mean.

In a sense, this paper is “simply” a call for the application of well established, sound statistical methods for reporting results in RL. The authors propose to include their recommendations as acceptance requirements for popular ML conferences, thereby forcing the community to a certain level of rigor. The fact that this work received an Outstanding Paper Award (Top 0.07%) at NeurIPS 2021 means that the community has recognized these issues as important and that it is likely that future reporting of RL results will incorporate some of the recommendations and thereby improve significantly.

Personal comment: I believe this paper to be an important contribution and a call for action - without sound statistical analysis the reported results may be overly optimistic and cause frustration and disappointment in attempts at reproducing them, thereby undermining trust in deep RL in general. One point was missing though, I believe: the research community is (rightly) focused on general algorithms that perform well across many tasks, so nearly all comparisons are reported for a multitude of tasks simultaneously. However, in practice one is often interested in the performance on a single task. Analyzing the variability of different algorithms on a single task should therefore be of high interest for practitioners, but this kind of analysis in not performed nor recommended here.

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade …

References

In this series →