The field of SBI has seen several new algorithms appear recently, in particular some based on neural density estimators. However, a public, standard benchmark with different kinds of tasks has been missing. Lueckmann et al. [Lue21B], introduce a benchmarking framework which allows the comparison over tasks with different dimensions, characteristics of the posterior and a number of relevant variables. Also, the benchmark can be extended with new algorithms, data sets and metrics via pull requests on GitHub.
As a baseline, the authors provide a total of eight algorithms for classic Monte Carlo Approximate Bayesian Computation (ABC), Posterior, Likelihood and Likelihood-ratio estimation, and their sequential counterparts. The algorithms are classified according to [Cra20F] and depicted in Figure 1. The selection of algorithms was intentionally kept small to focus on implementation details and hyperparameter tuning.
The authors use ten public datasets which, in contrast to most real problems, allow sampling from the true posterior. This is motivated by the shortcomings of single-sample metrics, as an algorithm obtaining a good MAP point estimate could pass the posterior-predictive check even if the rest of the estimated posterior is a poor fit. This observation led to include further metrics such as the Maximum Mean Discrepancy metric or Classifier 2-Sample Tests. Each dataset targets specific traits of an algorithm, e.g. the behaviour with increasing dimensions or spurious variables, or how it deals with multimodal distributions.
The authors obtained the following key results:
- Algorithms using Neural Networks for density estimation work better than ABC-based methods.
- The choice of comparison metric affects ranking.
- Sequential methods are more sample efficient.
- Some tasks show substantial room for improvement.
- There is no algorithm to rule them all, i.e. “no free lunch”
As the authors note, this benchmarking framework will help to compare methods and to identify problems and strengths. Its value will further increase as researchers and practitioners add more algorithms, tasks and metrics.