This series is dedicated to understanding the complexities of AI models, making them more transparent, and comprehensible to a wide range of audiences. Explainable AI (XAI) is a field focused on making the decisions and processes of AI systems understandable to humans. This is increasingly important due to the growing use of AI in critical domains and the need for accountability and trust in AI systems.
Overview of Explainable AI
Explainable AI refers to techniques and methods in the field of artificial intelligence that make the outputs of AI systems understandable and interpretable by humans. This is in contrast to the ‘black box’ nature of many AI models, where the decision-making process is opaque and difficult to interpret.
Government Regulations
The rise of AI applications in critical sectors has led to increased government interest and regulation. Many regions are now implementing guidelines and laws requiring AI systems to be explainable and transparent, especially when decisions impact human lives. This series will explore how XAI is shaping up to meet these regulatory requirements, offering both challenges and opportunities for AI developers and users.
Applications of Explainable AI
XAI plays a crucial role in enhancing trust, reliability, fairness, robustness, and resilience in various domains. By providing clear and understandable explanations of AI model decisions, XAI addresses several critical application areas [Kam21E, Lei23E]:
- Building Trust: XAI fosters trust in AI systems among users and stakeholders by making the decision-making process transparent. In sectors like healthcare and finance, where decisions have significant impacts, understanding the rationale behind AI predictions is essential for user acceptance. 
- Ensuring Reliability: In critical applications such as autonomous driving or aerospace, XAI helps in verifying that AI models function as intended. By providing insights into model decisions, XAI allows engineers and developers to validate and improve the reliability of these systems. 
- Promoting Fairness: XAI is instrumental in identifying and mitigating biases in AI models. In areas like recruitment or loan approval, explainability ensures that decisions are made fairly, without unjust discrimination, by revealing how different factors contribute to the outcome. 
- Enhancing Robustness: XAI aids in detecting vulnerabilities or weaknesses in AI models. By understanding how different inputs affect predictions, developers can fortify models against adversarial attacks or unexpected input variations, enhancing their robustness. 
- Improving Resilience: In dynamic environments, XAI contributes to the resilience of AI systems by facilitating rapid adaptation and troubleshooting. For instance, in changing market conditions, XAI can help financial models adjust to new data patterns. 
- Regulatory Compliance: With increasing regulation around AI, XAI assists in meeting legal and ethical standards by providing auditable explanations of model decisions, essential for compliance in regulated industries. 
- Personalization in Services: In consumer-facing industries like retail or entertainment, XAI enables personalized services by explaining recommendations or choices, thereby enhancing user experience and engagement. 
- Research and Development: In scientific research, XAI helps in hypothesis generation and validation by uncovering new patterns or relationships within data, accelerating innovation and discovery. 
By addressing these application areas, XAI not only improves the functionality and acceptance of AI systems but also ensures they align with ethical standards and societal values, making them more beneficial and acceptable to a wider audience.
Qualities of Explanations
In Explainable AI, the effectiveness of an explanation is gauged by several key qualities, each contributing to how well it meets audience needs [Mol22I]:
- Accuracy: Measures the precision of predictions made by an explanation. High accuracy is crucial when the explanation is used for predictions. However, lower accuracy may be acceptable if it aligns with the model’s accuracy and aims to clarify the model’s behavior. 
- Fidelity: Indicates how closely an explanation reflects the actual behavior of the model it represents. High fidelity is vital for truly understanding and trusting the model, particularly in critical applications. 
- Comprehensibility: Concerns the ease with which the target audience can grasp the explanation. Influenced by the complexity of the explanation and the audience’s background knowledge, comprehensibility is crucial for user acceptance, trust, and collaborative interactions. 
- Certainty: Assesses how well the explanation conveys the model’s confidence in its predictions. This aspect is key for informed decision-making and risk assessment in dynamic or uncertain situations. 
Balancing these qualities is essential for crafting explanations that are accurate, trustworthy, clear, and practically valuable in XAI.
Types of Explanations
In Explainable AI, various explanation types address different aspects of a model’s decision-making process [Mol22I]:
- Feature Importance Explanations: Identify and rank features by their impact on the model’s predictions, crucial in areas like finance and healthcare for understanding feature relevance. 
- Example-Based Explanations: Use specific instances to demonstrate the model’s behavior, effective in fields where concrete examples are more illustrative, such as image or speech recognition. 
- Counterfactual Explanations: Explain what changes in inputs could lead to different outcomes, offering actionable insights for scenarios requiring understanding of outcome alterations. 
- Local Explanations: Focus on individual predictions to explain why the model made a certain decision, using tools like LIME and SHAP. They are key in applications needing in-depth insights into singular decisions, like patient diagnosis in healthcare. 
- Global Explanations: Provide an overarching view of the model’s behavior, elucidating general patterns and rules, essential for contexts requiring comprehensive understanding and transparency, such as policy-making. 
- Causal Explanations: Delve into cause-and-effect relationships within the decision process, vital for fields where understanding these dynamics is crucial, like scientific research and economics. 
Each explanation type offers a distinct lens on the AI model’s decision-making, chosen based on the specific needs of the context, audience, and task.
Interpretability by design vs post-hoc methods
Within the world of explainable AI, two broad categories of methods are used to interpret AI models: intrinsically interpretable models and post-hoc blackbox methods. Intrinsically interpretable models derive their prediction via a transparent process the is naturally understandable to humans, while post-hoc blackbox methods interpret the predictions of opaque models after they have been trained. Blackbox predictions are made without revealing the decision-making process. They are often learned on the predictions of the blackbox model and might not be accurate [Rud19S]. Intrinsically interpretable models are therefore from the point of explainability conceptually in advantage since they guarantee to reflect the true decision making process of the model. However, one might need to design the model more specifically for the task at hand [Bel22I].
Intrinsically Interpretable Models
Intrinsically interpretable models are designed to be naturally understandable, sometimes sacrificing some level of expressivity for transparency (although whether this is truly necessary is a contentious point which has been disproved [Rud19S]). These models include decision trees, linear models, and rule-based systems, which provide insights into the decision-making process directly through their structure and the way they process data.
Key Features
- Transparency: The model’s internal workings are understandable by human intuition.
- Direct Interpretability: The decision-making process is clear without additional analysis or tools.
Classical Interpretable Models
Classical interpretable models in machine learning, known for their simplicity and transparency, include decision trees, linear models, and rule-based systems [Mol22I, Kam21E]:
- Decision Trees: Tree-structured for classification and regression, splitting data based on criteria to form a prediction path. Advantages include ease of understanding, visualization, and handling diverse data types without needing data scaling. However, they can overfit and be unstable with data changes. 
- Linear Models: These models (like linear and logistic regression) predict outcomes based on linear combinations of input features, characterized by simplicity and ease of interpretation. Effective for linear relationships and computationally efficient, but limited in handling complex, non-linear relationships, outliers, or multicollinearity. 
- Rule-Based Systems: Employ human-readable ‘if-then’ rules for decision-making, with each rule specifying a condition leading to a conclusion. Highly interpretable and easy to update, these systems depend on the quality of the rules and can become complex with many rules, potentially struggling with generalization. 
These models are vital in areas requiring clear insight into decision processes, such as healthcare and finance, offering a balance between predictive accuracy and interpretability.
Probabilistic Models
Probabilistic models in machine learning, known for their transparent approach and inherent interpretability, include graphical models like Bayesian networks and Markov models, as well as time series models such as SARIMA and Prophet:
- Graphical Models [Win20M]: Utilize graph-based representations to depict conditional dependencies between variables, aiding in understanding complex relationships and data structures. Their visual nature enhances comprehensibility, and they effectively handle uncertainty and incomplete data. However, they can become complex with more variables and require solid domain knowledge for correct setup. 
- Time Series Models: - SARIMA: Excels in forecasting time series data, accommodating both non-seasonal and seasonal components.
- Prophet [Tay17F]: Optimized for daily observations with patterns across different time scales, effective for data with strong seasonal effects. These models clarify temporal dynamics and forecast future events based on historical data. However, they demand substantial domain knowledge and may struggle with noisy or non-stationary data.
 
These probabilistic models are valuable for their deep insights into data structures and decision-making processes, especially in tasks requiring high interpretability.
Interpretable Deep Learning Models
Interpretable deep learning models effectively merge the predictive capabilities of neural networks with transparency, vital for applications where understanding decision-making is key. Notable models include:
- Symbolic-Based Models: These models incorporate symbolic expressions within neural networks. This process entails developing a model aligned with the data, training it, and then integrating symbolic expressions to replace internal functions. They are highly interpretable, offering analytical parallels to the model’s predictions, but require accurate symbolic fitting and can be complex. 
- Interpretable Attention Mechanisms [Lim21T]: Attention mechanisms in models like Temporal Fusion Transformers (TFT) enhance interpretability in time series forecasting and other applications. They focus on important features or time steps in the data, providing insights into how different elements influence the model’s predictions. While offering clearer understanding of model decisions, they may introduce higher computational demands during training. 
- Prototype-Based Models (ProtoPNet and ProtoTreeNet) [Nau21T, Nau21N]: These models use representative features or ‘prototypes’ for decision-making, as seen in ProtoPNet for image classification, and organize these prototypes in a decision tree structure in ProtoTreeNet. They offer transparency by allowing comparisons between inputs and learned prototypes, combining the interpretability of decision trees with deep learning’s power. However, the complexity of these models can sometimes obscure understanding, particularly in more intricate tree structures. Additionally, the presented prototypes might be misleading as they are further processed by non-interpretable mechanisms down stream [Hof21T, Nau21T]. 
Overall, these models demonstrate significant progress in making deep learning more interpretable and user-friendly, crucial for areas requiring clear and transparent decision-making processes.
Post-Hoc Blackbox Methods
Post-hoc methods are used to interpret models that are inherently complex and opaque (‘blackbox’), like deep neural networks or ensemble methods. These techniques are applied after the model has been trained and include feature importance scores, partial dependence plots, and LIME (Local Interpretable Model-agnostic Explanations).
Key Features
- Model Agnosticism: Applicable to any machine learning model.
- Insightful Analysis: Provides a deeper understanding of model behavior, often through visualizations.
- Reliance on proxies: Interpretations are often based on approximations of the model’s decision-making process, simpler representations, templates, or other proxies, making many of their claims to interpretability questionable [Rud19S].
Statistical Methods
Statistical post-hoc methods offer valuable insights into the relationship between features and a model’s output in AI, encompassing various techniques [Mol22I, Kam21E]:
- Partial Dependence (PD) Plots: Show the average effect of a feature on the model’s prediction, giving a global view of feature importance and its impact on the model’s output. They are easy to understand and widely applicable, but assume feature independence, which may not reflect true model behavior in the presence of strong feature interactions. 
- Individual Conditional Expectation (ICE) Plots: Extend PD plots by detailing the relationship between a feature and the outcome for each instance. ICE plots offer a nuanced, instance-level view of model behavior, highlighting variations missed by PD plots. However, they can become cluttered with many instances, making interpretation challenging. 
- Accumulated Local Effects (ALE) Plots: Concentrate on local prediction changes, aggregating feature effects over small intervals. ALE plots are more accurate in cases of feature interactions than PD plots and are less computationally demanding. Yet, they may struggle with highly correlated features and can be complex for non-experts to interpret. 
These methods collectively enhance the understanding of complex AI models by illuminating how changes in features influence predictions, each with its unique advantages and limitations.
Concept Based Methods
These methods build on the idea that deep learning models recognize high level while precessing an input. Concept based methods try to identify the recognized concepts and their influence on the model’s output.
- Concept Activation Vectors (CAVs) [Kim18I]: These vectors are used to identify the concepts recognized by a deep learning model. They fit a linear classifier to the activations of a model’s hidden layers, identifying the concepts recognized by the model. However, their effectiveness depends on the number of quality of concepts labeled data at training time. Since curating concepts is a manual process, it can be time-consuming and subjective. Therefore, concept activation is often trained on a small subset of the data, which can lead to inaccurate explanations.
- Concept bottleneck models [Koh20C]: explicitly structure the model to first predict human-understandable concepts from the input data, and then use these concepts to make the final prediction. This approach divides the prediction process into two stages, with the first stage focusing on identifying interpretable concepts that are directly used in the second stage to make predictions. The interpretability comes from the model’s architecture, which is designed to map inputs to concepts and then concepts to outputs.
Game Theoretic Methods
Game theoretic approaches in AI interpretability treat input features as players in a cooperative game, offering unique insights:
- Shapley Values [Lun17U, Cov20U, Mer19E]: Derived from cooperative game theory, Shapley values distribute a model’s output among its features based on their contribution. This allocation provides a fair understanding of each feature’s impact on the model’s decision. However, they can be computationally heavy for models with many features and often assume feature independence, which may not always be accurate in complex real-world data, leading to possible misinterpretations [Kum20P, Tau23M]. 
- Least Core [Yan21I]: This concept, also from cooperative game theory, examines the stability of the model (coalition) by identifying the minimal feature value change necessary to significantly alter the model’s output. It highlights sensitive features in the model. The application of the least core can be complex and computationally intensive in high-dimensional models. Like Shapley values, the least core may not fully account for feature interactions, potentially simplifying interpretability in models with interconnected features. 
These game theoretic methods provide a framework for understanding and interpreting the contributions and sensitivities of features within AI models, each with its own set of challenges and computational considerations.
Our library for data valuation implements those and many other techniques for valuation of training points, but it can also be used for feature attribution
Gradient based methods
Saliency maps and Integrated Gradients are vital tools in AI for visualizing and understanding influential features in predictions, especially in image processing and deep learning:
- Saliency Maps [Sim13D]: These are visual tools highlighting influential areas in inputs like images, indicating each pixel’s contribution to the final decision by computing the gradient of the output relative to the input. While saliency maps offer intuitive and direct visual interpretation of complex models such as deep neural networks, they can produce noisy or challenging-to-interpret results, particularly in highly abstract models or with complex inputs. 
- Integrated Gradients [Sun17A]: This technique enhances saliency maps by accumulating gradients from a baseline input to the actual input, providing a more comprehensive and detailed view of feature importance. Integrated Gradients offer consistent and detailed feature attribution, especially suited for deep learning models with non-linear behaviors. However, they are computationally intensive, require careful baseline selection for meaningful interpretations, and are most applicable to models with definable and informative gradients. 
Both methods significantly contribute to the interpretability of complex AI models by providing a clearer understanding of feature influence in predictions.
Interpretable Surrogate Models
(Local) surrogate models, particularly LIME (Local Interpretable Model-agnostic Explanations [Rib16W]), are key in machine learning for interpreting complex models by simplifying their predictions:
- LIME: Generates individual prediction explanations by forming a simpler model for each instance. It alters input data, observes prediction changes, and fits a simple model (like linear regression) to approximate the complex model’s specific behavior. Advantages include its model-agnostic nature, applicable to any machine learning model, and providing intuitive, feature-focused explanations. However, its local explanations may not always align with the complex model’s global behavior, and its effectiveness depends on the choice of the simpler model and perturbation strategy.
Training Data Attribution
Training Data Attribution in ML is the analysis of the impact of individual data points on model predictions. It can be useful for model debugging, fairness assessment, evaluating copyright protection claims, or anomaly detection.You can check our library for data valuation for efficient implementations of many of these methods.
- Influence Functions: [Koh17U, Bas20I, Fel20W, Fis23I, Fis23S] These assess how changes in training data, such as removing or altering a data point, affect a model’s output through an approximation process. This avoids the need for constant model retraining, thereby saving computational resources. The advantages include efficiency in estimating training data impact, insights into model behavior by identifying influential data points, enhancement of model robustness and fairness, and increased transparency by linking training data to model output. However, they also present challenges such as complexity in mathematical understanding, interpretation difficulties requiring a strong grasp of the model and data, and approximation limitations that may not fully capture the influence of a data point in complex models. Despite these challenges, influence functions are valuable for assessing and improving the reliability and fairness of machine learning models.

