Local calibration: metrics and recalibration

A calibration method that takes sample similarity into account, automatically providing group calibration even when groups are unknown.

To compute calibration of a classifier, one typically only looks at the confidences output by it, independently of the properties of the samples considered: calibration is, so to speak, “global in feature space” (for an introduction to calibration consider reading our review of the topic or attending our training). However, when one is interested in fair outcomes, calibration matters within groups of interest like ethnicity or gender: one wants to avoid being consistently underconfident in one subgroup (e.g. rare disease) and overconfident in another (flu). Under some conditions, one can expect this to be avoided automatically (basically having very low prediction error, as investigated in [Liu19I]), and it is also true that group calibration is in itself insufficient for fairness (one can simply predict class frequencies in subgroups and have a calibrated but terrible predictor).

Nevertheless, in practice it often makes sense to consider calibration conditioned to subgroups (e.g. when the model is not able to auto-calibrate because it’s not accurate enough). One wishes to miminise Expected Calibration Error (ECE), which is the average discrepancy between confidence and accuracy, within subsets of the population. [Heb18M] put forth an algorithm for so-called multi-calibration, or calibration wrt. all possible groups in a given concept class, not only those explicitly encoded as categorical features, but alas, computational cost is high.

In order to achieve calibration for intrinsically defined groups, as opposed to groups explicitly defined by some categorical input feature, the authors of [Luo22L] (presented this August at UAI2022), propose to use soft clusters and a modified calibration error. First, they define a similarity measure between samples using a Laplacian kernel over learned features. The features are intermediate layers in one of a number of networks in their experiments, the rationale being that Euclidian distances are semantically meaningful in these feature spaces. Second, ECE is extended to include a weighting by this kernel.

The resulting definition of Local Calibration Error is stricter than the conventional Maximum Calibration Error (MCE, itself stricter than ECE): optimising for Maximum Local Calibration Error (MLCE) leads to significant improvements in group-wise MCE, despite the fact that the method proposed for this optimisation (LoRe, Local Recalibration) is unaware of the actual groups.

LoRe is a generalized form of histogram binning where samples in the same confidence bin are weighted by the kernel: predicted confidences are corrected to match a weighted average over similar samples. This yields great results in group-wise calibration at the cost of some global calibration error.

Table 1. Performance on downstream fairness, as measured by maximum group-wise MCE (lower is better). Experimental settings as described in Section 5.3. Mean and standard deviations are computed over 60 random seeds for settings 1 and 4, and 20 for settings 2 and 3. Best results are bold.

It is interesting to consider what the bandwidth of the kernel means. As it grows towards infinity, MLCE converges to MCE. But as it decreases towards 0, and the effective neighborhood of a point is reduced to itself, MLCE is just the discrepancy between the confidence and “accuracy” for a single point.

Figure 4. MLCE vs. kernel bandwith $\gamma$ for ImageNet. LoRe (with t-SNE and $\gamma = 0.2$) achieves the lowest MLCE for a wide range of $\gamma$. This suggests that LoRe leads to lower LCE values across the whole dataset


In this series