[Her23C] provides empirical evidence that several popular algorithms for Simulation-based Inference provide overconfident approximations by excluding unlikely but plausible parameters. Therefore, one should estimate reliability alongside performance, but the authors argue that popular metrics for the quality of posterior approximations are insufficient to detect overconfidence.

As an example, a classifier is trained to solve a two-sample test [Lop17R] to distinguish between samples from the true posterior and samples from the approximated posterior. The discriminative power of the classifier is measured by the Area Under the Receiver Operating Characteristic Curve (AUROC). In Figure 1, the same power of $AUROC=0.7$ was obtained for two approximations. While both are equally good, the overconfident one is more likely to produce unfaithful results by wrongly excluding plausible parameters.

The authors ran extensive experiments on seven datasets to investigate the
posterior estimation for neural density estimators and ABC algorithms. Instead
of estimating quality of approximation to a target distribution with a
hypothesis test, their degree of conservativeness was assessed using the
*expected coverage of credible regions*, defined as

$$ \mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{p(\theta \mid \mathbf{x})}\left[\mathbb{1}(\theta \in \Theta_{\hat{p}(\theta \mid \mathbf{x})} (1-\alpha)) \right] \approx \frac{1}{n}\sum^n_{i=1}\mathbb{1}(\theta_i^{\ast} \in \Theta_{\hat{p}(\theta_i \mid \mathbf{x}_i)} (1-\alpha)) $$

where $(\theta^{\ast}_i, \mathbf{x}_i) \sim p(\theta, \mathbf{x})$ are i.i.d. samples from the true posterior and $\Theta_{\hat{p}(\theta \mid \mathbf{x})}(1-\alpha)$ denotes the $1-\alpha$ highest density credible interval of the estimated posterior.

The authors define a conservative posterior estimator to cover at least the credibility level. Thus, the goal is to obtain an expected coverage probability equal or larger than the credibility level. Further, the expected coverage alone does not tell much about the information gain of a posterior over its prior. Therefore, the authors propose to use the expected information gain as a second metric.

$$ \mathbb{E}_{p(\theta, \mathbf{x})}\left[ \log p(\theta \mid \mathbf{x}) - \log p(\theta) \right] $$

To investigate conservativeness of algorithms, the expected coverage probabilities are computed on a set of unseen samples from $p(\theta \mid \mathbf{x})$, for all confidence levels under consideration. Well calibrated estimators have an expected coverage probability of $1-\alpha$ and produce a diagonal line when the coverage probability is plotted. Conservative estimators produce a curve that is above the diagonal line, overconfident estimators a curve below the diagonal line.

The AUCROC for the different algorithms and datasets shows that no method produces conservative results on all tasks. The sequential versions tend to be overconfident while requiring less simulations, i.e. a smaller simulation budget, as seen in Figure 2. The authors also investigate ensembles, which turned out to be more conservative.

The authors obtained the following **key findings**:

- All considered algorithms produced non-conservative posterior approximations for at least one task. Small simulation budgets amplify this behavior. However, a large budget is no guarantee for a conservative posterior approximation.
- Amortized approaches are more conservative than non-amortized ones. This might be due to the latter strengthening their overconfidence each round.
- The expected coverage probability of an ensemble is larger than that of the average individual model. The ensemble size is beneficial for the expected coverage probability as well.