Decisions made based on the output of classifiers are sensitive not only to their accuracy or type I/II errors, but also to their faithfulness in representing the true class distribution with their confidence estimates. Failing to account for this calibration can have grave consequences, as this game tries to evidence. You can play the game here.
Concepts illustrated
- The top predicted class is not enough in many decision contexts. Instead, all class (disease) probabilities provided to the decision maker must be correct, i.e. the classifier providing them must be calibrated. This is because the optimal result that maximizes the total expected lifespan depends directly on all probabilities and not just on the predicted class.
- Optimal decision-making becomes increasingly difficult as calibration worsens, even if (top-class) accuracy is maintained.
- Recalibration of models is possible and can happen intuitively: given an uncalibrated classifier, it is possible for the player to outperform the optimal algorithm that is based purely on disease probabilities. This is because by observing the outcomes of each round the player can in principle “recalibrate” their decision-making process, especially if the classifier is always overconfident or always underconfident.
About the implementation
Kale is a single-page react application running fully in the browser which uses our calibration library kyle. In order to do achieve this we embed a fully functional port of the cPython interpreter in the browser called pyodide.