The difficulty to understand the decisions of complex neural networks is a major limiting factor for their adoption in practice. While a huge amount of techniques have been proposed, many of them provide rather technical explanations which are not easily understood by laymen.
A good example are saliency
maps.
In audio prediction tasks one typically highlights the saliency maps on
audiograms or spectrograms. However, such saliency maps offer only limited
transparency because spectrograms are too technical to be interpreted by
non-domain experts. Another drawback is that the most important regions for a
decision might coincide among all possible classes. For instance, the saliency
maps
for an emotion prediction model trained on face images might always highlight
the eyes and the mouth of the face. Hence, saliency maps are insufficient to
assess why a certain emotion was predicted by the model. The authors of
[Zha22R] (best-paper award at
CHI 2022) argue further that
the way in which an AI system makes its decision should be human-like
in order to earn people’s trust. They draw inspiration from theories
in cognitive psychology by designing their architecture after the perceptual
process [Car78P], which states that people
select, organize, and interpret information to make decisions: 1) First, a
subset of the sensory information is selected. 2) The selected regions get
organized into meaningful cues, e.g. ears, mouth, and nose for a face. 3)
Finally, the low-level cues get interpreted towards high-level concepts, e.g.
the characteristics of ears, mouth, and nose are used to distinguish humans from
animals.
In the proposed architecture, the first step of the perceptual process relates
to saliency maps where the high saliency regions represent the selected subset
of the sensory information. The cues are domain specific and predefined.
Depending on the context they can either be computed directly from the input or
learned from annotated examples. Most theories assume that the brain generates
or remembers counterfactual examples and compares the current perception against
them on the cues in order to attribute the perception to the high level
concepts. The authors propose to use generative adversarial networks and style
transfer to generate examples from all classes that are otherwise similar to the
input. The system uses the differences on the cues to provide contrastive cues
as an additional layer of explanation. The final classification is made from an
embedding of the input that is obtained from the original model and the
contrastive cues, which makes the decision process more relatable. The entire
architecture is depicted below. The system provides the saliency maps, the cues
and the contrastive cues to explain the decision.
The paper concludes with extensive empirical studies on a voice emotion prediction task. They show that the model does not only provide better explanations in think-alout and controlled user studies, it also outperforms the vanilla convolutional neural network model in accuracy. If you would like to try it, there is a nice demo and some short video presentations on the authors lab page.