Explainability#

Despite the remarkable recent evolution in prediction performance by artificial intelligence (AI) models, they are often deemed as “black boxes”, i.e., models whose prediction mechanisms cannot be understood simply from their parameters. Explainability in machine learning refers to the ability to understand and articulate how models arrive at their predictions. This is crucial for promoting transparency, trust, and accountability in AI systems. It helps in verifying model behavior, refining models, debugging unexpected behavior, and communicating model decisions to stakeholders.

Feature Importance #

Feature importance is a key approach to analyzing explainability. It assesses the contribution of each feature to the model’s predictions. There are two main types of feature importance: global and local.

Global Feature Importance#

Global feature importance provides insights into the overall model by indicating how much each feature contributes to the model’s predictions across the entire dataset. Two common methods for global feature importance are permutation and surrogate models.

Permutation Feature Importance: This method involves shuffling the values of each feature and measuring the change in the model’s error. If shuffling a feature’s values increases the error significantly, the feature is considered important. This approach is model-agnostic and intuitive.
The permutation feature importance measures the importance of a feature by calculating the increase in the model’s prediction error after the feature’s values have been perturbed. In this context, the perturbation involves shuffling the feature’s values.

With this approach, a feature is considered more important if shuffling its values leads to a significant increase in the model’s prediction error. Conversely, a feature is considered less important if shuffling its values results in little to no change in the model’s prediction error.

The basic algorithm for permutation feature importance is:
Input: trained model $\hat{f}$, feature matrix $X$, target $y$, error measure $\mathcal{L}(y, \hat{f})$

Estimate model error $\epsilon_{0}=\mathcal{L}(y, \hat{f}(X))$

For each feature $j$ in $1, \dots, p$: - generate a permuted feature matrix $\bar{X}$ - estimate $\bar{\epsilon}=\mathcal{L}(y, \bar{X})$ - compute the permutation feature importance ratio $\mathcal{F}_{j}=\frac{\bar{\epsilon}}{\epsilon_{0}}$ or the difference $\mathcal{F}_{j}=\bar{\epsilon}-\epsilon_{0}$

Sort the features by descending $\mathcal{F}$
This algorithm follow the implementation proposed by Fisher et al. (2018).
Surrogate Models: Surrogate models involve approximating the complex model with a simpler, interpretable model (like a decision tree). By fitting the surrogate model to the predictions of the complex model, we can gain insights into the decision-making process of the original model.
A surrogate model is an interpretable model designed to approximate the predictions of a more complex machine learning model. The goal is to achieve a model that balances good accuracy with interpretability. To obtain a surrogate model we can employ the following steps:
Input: dataset $X$, a black-box model $g$, a interpretable model $f$

Select a dataset $X$

Get the predictions of $g$

Train $f$ on $X$ and get its predictions

Measure the performance of $f$ to replicate the predictions of $g$ (e.g. R-squared)

Interpret the results of surrogate model

Local Feature Importance#

Local feature importance focuses on understanding the contribution of features to individual predictions. Two popular methods for local feature importance are SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations).

SHAP: provide a unified measure of feature importance for individual predictions by computing the contribution of each feature to the prediction. This method is based on cooperative game theory and ensures fair attribution of feature importance.

The SHAP presents a unified framework for interpreting model’s predictions. SHAP assign each feature an importance value for a particular prediction. The work also show that game theory results guaranteeing a unique solution apply to additive feature attribution methods and SHAP is a solution with some desired proprieties: (1) local accuracy, (2) missingness, and (3) consistency.

Additive feature attribution have an explanation model $g$ that is linear function of binary variables

\[g(z') = \phi_{0} + \sum\limits_{i=1}^{M}\phi_{i}z'_{i}\]

where $z'\in \{0,1\}^{M}$, $M$ is the number of simplified input features, and $\phi_{i}\in\mathbb{R}$.

Methods with the previous equation attribute an effect $\phi_{i}$ to each feature, and summing the effects of all feature attributions approximates the output $f(x)$ of the original model. There are several methods that match this definition, like SHAP and LIME. But the paper argues that SHAP is the unique model that follows the equation and satisfies the desired proprieties 1, 2 and 3.

To achieved this, SHAP uses Shapley values, a result observed in cooperative game theory. We can define Shapley values as

\[\phi_{i}(f, x) = \sum_{z'\subseteq x'} \frac{|z'|!(M-|z'|-1)!}{M!}[f_{x}(z')-f_{x}(z'_{i})]\]

where $|z'|$ is non-zero entries in $z'$, and $z'\subseteq x'$ represents all $z'$ vectors where the non-zero entries are subset of non-zero entries in $x'$.
LIME: explains individual predictions by approximating the complex model with an interpretable model locally around the prediction. By perturbing the input data and observing the changes in predictions, LIME identifies the most influential features for that specific instance.

The main goal of LIME is propose an explanation method to be applied in any classifier or regression model. In this context, explain is presenting visual or texts artifacts that provides qualitative understanding of the relationship between the instances components and the model’s predictions.

We can define the explanations produced by LIME as

\[\mathcal{E}(x) = arg\min\limits_{g\in G} ~\mathcal{L}(f, g, \pi_{x})+\Omega (g)\]

So, the explanation $\mathcal{E}$ of a given instance $x$ is equal the minimization of the fidelity function $\mathcal{L}$ while having the complexity $\Omega (g)$ low enough to be interpretable by humans. In this way, $\mathcal{L}(f,g, \pi_{x})$ measure how unfaithful the model $g$ is in approximating the model $f$ being explained in the locality defined by $\pi_{x}$.

The results find in the paper indicate that LIME is useful to increase trust in black-box models and model selection (avoiding models with good accuracy but with wrong motivations, i.e, using a priori “non-sense” features to make predictions).

Measuring and Mitigation #

Metrics

Explainability#

Feature Importance#

Global Feature Importance#

Local Feature Importance#

Measuring and Mitigation#

Feature Importance #

Measuring and Mitigation #