Open AccessArticle

Open and Extensible Benchmark for Explainable Artificial Intelligence Methods

Ilia Moiseev

Ksenia Balabaeva

and

Sergey Kovalchuk

Faculty of Digital Transformations, ITMO University, Saint Petersburg 197101, Russia

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(2), 85; https://doi.org/10.3390/a18020085

Submission received: 29 October 2024 / Revised: 26 January 2025 / Accepted: 28 January 2025 / Published: 5 February 2025

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Download

Browse Figures

Figure 1
Three stages of the general XAIB workflow—Setup, Experiment, and Visualization. Each Setup is a unit of evaluation. It contains all the parameters and entities needed to obtain the values. The execution pipeline takes setups and executes them, writing down the values. The values can then be manually analyzed or put into the visualization stage. "> Figure 2
Use case diagram with groups of users. Arrows represent interactions with different components of the XAIB. Each group has different goals; therefore, their interactions are different. Developers contribute new functionalities and entities. Researchers and Engineers interact in a similar way but have different goals. Researchers propose their own method; for them, setup is a variable. When Engineers select a method for their own task, for them, the method is a variable. "> Figure 3
Results on the first setup—SVM—on the breast cancer dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, shap outperforms LIME on the majority of metrics. "> Figure 4
Results on the second setup —NN—on synthetic noisy dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, LIME outperforms shap on the majority of metrics. ">

Versions Notes

Abstract

The interpretability requirement is one of the largest obstacles when deploying machine learning models in various practical fields. Methods of eXplainable Artificial Intelligence (XAI) address those issues. However, the growing number of different solutions in this field creates a demand to assess the quality of explanations and compare them. In recent years, several attempts have been made to consolidate scattered XAI quality assessment methods into a single benchmark. Those attempts usually suffered from a focus on feature importance only, a lack of customization, and the absence of an evaluation framework. In this work, the eXplainable Artificial Intelligence Benchmark (XAIB) is proposed. Compared to existing benchmarks, XAIB is more universal, extensible, and has a complete evaluation ontology in the form of the Co-12 Framework. Due to its special modular design, it is easy to add new datasets, models, explainers, and quality metrics. Furthermore, an additional abstraction layer built with an inversion of control principle makes them easier to use. The benchmark will contribute to artificial intelligence research by providing a platform for evaluation experiments and, at the same time, will contribute to engineering by providing a way to compare explainers using custom datasets and machine learning models, which brings evaluation closer to practice.

Keywords:

XAI evaluation; benchmark; interpretable machine learning; interpretability metrics; explainable artificial intelligence

1. Introduction

Approaches based on machine learning (ML) models are incorporated into an increasing number of fields, replacing or augmenting traditional approaches. As the size of the models increases over time to tackle more complex tasks, it becomes harder to understand the outcomes. In addition, there is an increasing interest in systems that satisfy not only the accuracy criterion but also a set of additional criteria such as fairness, safety, or providing the right to explanation [1]. For some areas, satisfying those criteria is a dealbreaker when it comes to the adoption of an ML system [2,3,4].

To address these issues, eXplainable Artificial Intelligence (XAI) has emerged [5]. It encompasses a variety of approaches, from highlighting relevant parts of the input data to showing similar examples or even providing verbal explanations.

While actively expanding the variety of explanation algorithms, the field of XAI is often criticized for its lack of rigor and evaluation standards [5,6]. Given the uncertainty of explainability definitions noted in many recent works [7], it seems necessary to explicitly state which one is used in this work. In this work, the definition is the same as in the work of Doshi-Velez. Interpretability is the ability to explain or present information in understandable terms to a human. To further clarify, for inclusion purposes, no distinction is made between interpretability and explainability, similar to the work of Nauta et al. [8].

The growing number of explanation solutions with different approaches demands a way to compare them. Isolated evaluation-centered papers exist, but they are very limited compared to benchmarks.

In recent years, several attempts have been made to build a benchmark for XAI methods. However, the existing benchmarks have a number of drawbacks. They, for example, focus only on feature importance, limited or no openness and extensibility, a limited set of measured interpretability properties, which are usually not justified explicitly, focusing on ground-truth-based metrics alone, and missing important documentation, versioning, and software distribution aspects.

In this work, a new XAI Benchmark (XAIB) is proposed (https://oxid15.github.io/xai-benchmark/index.html (accessed on 29 October 2024)). It was designed keeping in mind the diversity of the XAI landscape. It was important for the generality to be able to include different explanation types, models, and datasets. At the start, it already features two different explanation types, as well as various datasets, models, and explainers that can be used in many combinations. Special efforts were made to make it easy to use and extend. It features thin, high-level interfaces for easier setup while not obfuscating the internals, opening itself for customization and extension. Finally, its XAI evaluation system is based on a comprehensive framework of 12 properties, ensuring the completeness of the evaluation for every explainer type.

2. Related Work

Before describing the proposed benchmark, the context of the current state of XAI, XAI evaluation, and benchmarks is needed. First, explanation types will be briefly introduced and evaluation works will be considered in general to provide an overview of the XAI evaluation history. In the end, existing XAI benchmarks will be described.

2.1. Explanation Types

This section aims to give the reader the context for different types of XAI methods and the terminology that will be used further. Various taxonomies of XAI methods exist. In this work, the system that allows highlighting properties important for the evaluation of different explanations is used. A similar system was employed in the work of Bodria et al. [9]. The terminology that is important for this work is “feature importance methods” and “example selection methods”.

In reference to the work mentioned, the term “feature importance” is used more generally and describes all methods that highlight important information in the input in order to explain the output. Although in most cases the usage is the same as in the referenced work, since tabular feature importance seems to be the most popular explanation method. Methods of this type can also be called “saliency methods” or “feature attribution”.

The term “example selection” is the same as “prototypes”. It was specifically chosen to contrast with “example generation”, since prototypes can also be artificial, and a direct comparison between the two seems unreasonable due to the different nature of the examples themselves.

There are also counterfactual explanations that involve examples, concept attribution, and the set of intrinsically interpretable methods. They are worth mentioning to demonstrate the variety of approaches that the XAI field incorporates.

2.2. XAI Evaluation

The earliest XAI evaluation works were mostly focused on user studies [10]. The first definitions of explanation properties appeared, such as, for example, soundness and completeness [11]. The overall shift in this field was towards finding more ways to measure the quality of the explanation numerically. It was done, for example, using sanity checks [12] or measuring how well humans can learn prediction patterns using explanations [13]. More properties appear, such as, for example, bias, commonness, robustness [14], or faithfulness, sensitivity, and complexity [15].The lack of common terms and frameworks can be seen even in the use of different names for properties or slightly different mathematical formulations of them. This inconsistency leads to attempts to generalize, as, for example, in the work of Sokol and Flach, where Explainability Fact Sheets were proposed [16]. Those serve as a way of creating a common taxonomy around XAI methods, structuring what is important about each method in particular. Another example is the work of Nauta et al. [8], which proposes a complete evaluation scheme for XAI methods, considering interpretability as a multifaceted concept. They propose a rather complete set of 12 properties as a guide that future researchers in this field can follow when categorizing their metrics.

Although XAI cannot be characterized as a novel subject, there is still a lack of a common language and an evaluation basis. The XAI evaluation landscape is important for the context, but it is not as relevant as attempts to build a benchmark for XAI methods. The main difference seems to stem from the end user’s viewpoint. Usually, metrics in XAI evaluation papers stay on paper, or when they are open source, they are scattered in different repositories, making it difficult to try different metrics on the same method. In this case, benchmarks serve as the middle ground between various research papers and practitioners or researchers trying to evaluate their XAI method against already existing ones.

2.3. Benchmarks

For this overview, the projects were selected on the basis of several criteria. For one, each of them explicitly articulates that it constitutes a benchmark for explainable AI methods. The second important criterion was open source code. Openness is considered an important aspect of a benchmark. Being open source should help reach end users and become widely employed. The ensuing overview will follow a chronological order, as in previous sections, despite the fact that benchmarks cannot claim a long history. All projects in this overview were published between 2020 and 2022.

The authors of “A Diagnostic Study of Explainability Techniques for Text Classification” propose several diagnostic properties that measure different aspects of interpretability [17]. Mainly, they focus on text data and the saliency methods used for it. Their metrics are partially application-grounded, using the agreement with human annotations of token importance; however, four out of five of them are functionally grounded. The methods used are not model-agnostic because they explicitly extract and compare the weights of a neural network model. As metrics or diagnostic properties, they use agreement with human rationales in importance scores for the sentiment classification task.

This work lies somewhere on the boundary between benchmark papers and regular XAI evaluation papers. Regarding explainer types, this work is focused exclusively on feature importance in text classification. Due to the highly specific nature of experiments, they were not intended to be extended or customized, which impedes further development. Although the proposed metrics are valid, the particular set and its completeness are not justified in the paper. It features a table with the evaluation results and visualizations, but since it all resides inside a paper, this information cannot be updated.

Liu et al. propose an approach that is based on synthetic datasets. Real-world data require human annotation if ground-truth importance is required [18]. When creating synthetic datasets, researchers can manipulate them in such a way so as to know which features are important for the outcome and to what extent. In their work, they mainly use the order of the features when they are sorted by their importance scores, since it is difficult to compare different scores and set the exact ground-truth value. The authors use their generated datasets to compute several popular metrics such as correctness, faithfulness, monotonicity, and remove-and-retrain (ROAR). XAI-Bench is the name of the project that is published along with the paper. It contains the source code and instructions on how to reproduce the results of the paper.

The project is focused on classification and feature importance methods and seemingly does not show any intention of growing beyond that. The metrics employed there are known, although the choice of metrics was not explicitly discussed. This project is very similar to the evaluation papers in the aspect of further development; documentation on how to contribute seems missing, as do the instructions on running the user’s own setup. It also seems to lack versioning, making it difficult to track how changes in the implementation of evaluation procedures influence the results in the final table.

Agarwal et al. propose OpenXAI, a benchmark for functionally grounded evaluation of XAI methods with a leaderboard that is available and updatable online [19]. The authors propose 22 metrics measuring the different properties of explainers and argue that their solution can be applied to several modalities and extended by adding new models, metrics, and datasets.

The main focus of the project is classification and feature importance, which are built into the benchmark architecture. The results of the evaluation are presented and available online, along with the documentation. The documentation itself is clear, with instructions on how to use dataloaders, models, and explainers. However, it does not provide any details on how to use or include your own entity in the evaluation process. Featuring a considerable set of metrics divided by three major properties, such as faithfulness, stability, and fairness, the authors do not explicitly address the question of completeness or their particular choice. The project provides clear versioning, which enables the meaningful tracking of changes in the evaluation process.

Belaid et al. in the work “Do We Need Another Explainable AI Method?...” propose a benchmarking solution based on the principles of software testing [20]. The authors argue that it is hard to compare different methods against a large number of metrics. Although they conduct a lot of tests, they also compile all the results into one metric using hierarchical scoring. The set of tests is diverse; all of them constitute six properties as follows: fidelity, fragility, stability, simplicity, stress, and other. In the end, each method gets an overall score, which allows for plotting those scores against the speed of inference. The authors argue that it provides ease of comprehension for the evaluation results.

The benchmark works only with feature importance methods and only evaluates classification tasks. All tests are hard-coded; therefore, the user is provided with only abstract information on the quality and cannot be sure if some method suits their own setup. The evaluation exhibits a variety of properties; however, the authors seemingly have not provided a rationale for their selection or a statement regarding the completeness of the set.

The documentation is provided in conjunction with the contributing instructions that lay the foundation for future development. However, the code does not provide any versioning information.

In the work of Li et al., another benchmarking solution was proposed [21]. The project named

M^{4}

provides faithfulness evaluation for image and text modalities.

As mentioned in the title of the paper, the only property measured is faithfulness. The authors implemented five metrics which were combined into a single averaged score of faithfulness. The solution they propose features two tasks of different modalities—image classification on ImageNet and sentiment analysis on MovieReview dataset. The solution was implemented using InterpretDL, which is built on PaddlePaddle—a deep learning framework with a modular design. This decision allows for potentially expanding the benchmark by adding more models and explainers. However, the expansion seems to be limited to models based on neural networks and feature attribution methods. The project itself does not appear to be built like a traditional benchmark. The results are not available outside the paper; the leaderboard of methods does not exist online. The documentation is accessible since

M^{4}

is available as part of the InterpretDL package. It covers the usage of metrics but lacks documentation on using custom models or datasets with it. Although InterpretDL is not a benchmarking solution, it provides the metrics described in the work. This means that using it can distribute metrics. It is available for installation on PyPI and provides semantic versioning.

In summarizing the overview, some common issues across modern benchmarks can be identified. The field is developing and the latest active projects are very recent.

Every benchmark described is focused solely on feature importance and mostly on classification tasks. Most of the benchmarks have very limited extensibility; their results were either recorded on some hardcoded tests or on some specific setups. The set of metrics and their completeness are usually not discussed explicitly, and they are not designed to be extended. Most of the projects also lack versioning and comprehensive documentation.

3. Benchmark Structure

The main idea of the XAIB design is to push the trade-off between extensibility and ease of use. If the system is very extensible and customizable, it tends to overwhelm new users. In the opposite scenario, when it is very easy to use, it usually hides a lot of complexity under the hood, which can hinder customization for users with in-depth knowledge.

Since simplicity and customization create a trade-off, they cannot be significantly increased at the same time. Taking into account the current state of the XAI field, the XAIB prioritizes simplicity for the user first. Popularity and, in general, interpretability awareness are not as high, so it seems that configuration complexity may overwhelm many users who are not familiar with XAI or programming in general [22]. This is the motivation for making the XAIB easier to use while allowing for a deep enough extensibility to include various explanation methods in one benchmark.

This section provides an overview of high-level entities, the XAIB modules, and its dependencies. Then, it focuses on how extensibility and consistency are ensured. The final subsections offer the categorization of potential users, their needs, and the use cases of the benchmark.

3.1. High-Level Entities

In order to name entities that may be required for XAI evaluation, a typical workflow for this task should be summarized. It starts with the data. It is required to train the model, if necessary, in the evaluation process and also to evaluate it. A machine learning model and a dataset are then passed to the explainer to extract explanations. Those are, in turn, used to evaluate an explainer using a set of quality metrics. After the computation, the metric values should be stored for further analysis or visualization.

This is the general outline of an XAI evaluation experiment. Figure 1 shows how it is handled in the proposed benchmark. According to this description, each experiment requires the handling of data, models, explainers, metrics, and experiments.

3.1.1. Data

The dataset is the first object that is required for the evaluation of any explainer, whether that is a training set used to fit a model or an evaluation set used to obtain explanations. This is why data handling is the first thing that is necessary when building a benchmark. In the XAIB, data are represented by a special dataset class. Datasets in the XAIB were built with the diversity of the XAI landscape in mind. They represent an access point for any data source and encapsulate data loading and access while providing a unified interface. Each dataset is a sequence of data points. Each point was determined to be represented by a Python dictionary. The reason for this is that datasets in machine learning are usually complex, with different layers of labeling, and they can also be multimodal. Dictionaries enable handling different data channels, for example, the features and labels, without the risk of confusion because each channel has a unique dictionary key. This configuration also enables one to build automatic data validation at the top.

As an example, some datasets from the sklearn library were included in the benchmark. Since toy datasets are small, they are loaded into memory at the initialization of a dataset object and can then be accessed by index. For the classification task, the output object is a dictionary with the keys item and label.

The interface does not place strong requirements on how data should be stored or loaded, enabling us to include a variety of datasets into the benchmark and to access them in the same way, which is an important aspect of the extensibility of the benchmark.

3.1.2. Models

To explain the model’s predictions, the model itself is required. Models need special care since they have a multitude of requirements. In the evaluation workflow, they may need to be trained, evaluated, their state saved for later, and loaded. All of this should be handled in the same fashion, regardless of the model type. For handling models, the XAIB features the model class. The model encapsulates all that is needed for inference and training, if that is required. It is a wrapper around an inference method built to be adapted to any backend. Models can be trained, inferred, evaluated, saved, and loaded in the same fashion. The interface is similar to the widely known sklearn library, which should simplify the work with models for users who are already familiar with that package. As with the data, the interface is general enough to allow various types of models to be included. Datasets require a special format of output, such as a dictionary. Models do not have such requirements. However, explainers can have input requirements and not every model is compatible with every explainer, at least by input type.

As with the datasets, the default wrapper for the sklearn models was implemented. It manages the training and evaluation of models while maintaining a common interface.

3.1.3. Explainers

Explainers are built similarly to the models. They include similar methods since their evaluation workflow is usually the same. Only one assumption was made as follows: the model is required for inference by default. This is not required for the descendants of the explainer class but was made for consistency since, usually, explainers require the model object at the time of obtaining explanations.

The general idea on which the XAIB is built is the self-sufficiency of entities. In the context of an explainer, this means that it should accept all configuration parameters on initialization and then be able to explain model predictions when given only the model and the data. This abstraction enables the unified handling of different entities within the benchmark. All models or explainers that comply with this can be treated the same within a single category that sets the input and output requirements. The outputs of feature importance and explanation-based approaches will definitely be different, for example.

3.1.4. Metrics

The evaluation is handled using a set of metrics. Unlike other similar benchmarks, the XAIB treats metrics as separate objects with their own metadata and not just as functions. Metrics are handled like this because, in the XAIB, they are not just functions but have their own states. They store a creation time, name, value, direction (the higher the better or the less the better, are denoted by “up” and “down” in the library), and references to the dataset and model they were computed on.

A metric is a complex object that has three major roles in the XAIB. Its primary role is a value storage. Metrics store a single scalar value that is considered their main value. Metrics are also functions. They encapsulate the way they compute their values, similar to explainers and models. After configuration, they should be able to be computed. Furthermore, the last role is metadata management. The metrics record a dataset and a model that were used for the calculation. This information is then used to trace values back to their setups and assess the quality of explanations in different cases.

To make metric design more flexible, since they can feature only a scalar as a value, the XAIB also features fields like interval and extra. The interval can be used if a confidence interval was computed, so this field can store the upper and lower boundaries of a value. The extra field is made to be adjustable for some more complex scenarios; it is a dictionary made for the additional metadata or supplementary values without format requirements. Those measures should allow the flexible usage of a metric class to wrap more complex metrics from the field.

3.1.5. Cases

The XAIB evaluation framework is based on the Co-12 properties proposed in the work of Nauta et al. [8]. Their work is an attempt to gather and systematize the aspects on which XAI methods could be evaluated. They argue that interpretability is not a binary property and should not be validated with anecdotal cases alone. Thus, they propose the framework of 12 properties, where each property represents some desired quality of the method. Namely, these are as follows: Correctness, Completeness, Consistency, Continuity, Contrastivity, Covariate Complexity, Compactness, Composition, Confidence, Context, Coherence, and Controllability. This framework of properties was chosen mainly for its completeness; it is likely to cover most of the ways in which one can measure the quality of explanations.

To implement the idea of a property that can also be measured in one or more ways, the benchmark features cases. Each case is a representation of one of the properties. If a metric is implemented in the benchmark, it should be attached to a case to signify that it measures a property. Cases function not only as containers for metrics but also as evaluation units. They are the place where the created metrics are evaluated. The benchmark has a collection of predefined cases for the Co-12 properties that already feature corresponding metrics, but cases are not limited to that. Using the add_metric methods, users can create custom evaluation units or extend existing ones.

3.1.6. Factories

There are also special interfaces that help users perform basic operations without having to dive deep into the documentation or implementation details of what they are using. Those interfaces are factories, setups and experiments. They are the tools that can shift the trade-off between complexity and ease of use.

Factories, as the name suggests, provide a uniform interface for creating XAIB entities. They are used to create datasets, models, explainers, and cases without having to input any parameters except their names.

The main use case for this entity is the use of default factories. They can be found in the benchmark evaluation module, which is filled with already parameterized constructors of datasets and models. Since every configuration is already made, it is easy to create a default instance using a factory.

3.1.7. Experiments

There are no special objects to handle the experiments themselves, but there are default procedures that take a case, explainers, and other parameters, run evaluations, and save the results to the disk. Similarly to factories, they represent a default experiment run. The XAIB entities can be used without such utility; however, it is deemed helpful for users to easily compute and save evaluation results.

In its internal workflow, it is a decorator that wraps a case. It accepts a folder to store experiment results, a list of explainers to evaluate, and arguments for a case’s evaluation method, like batch size, which is the default. After creating a case it starts iterating over the given explainers, providing them with the case to evaluate. All metadata for the case are then saved to the folder provided in a structured manner using JSON files. It contains metadata for every metric that is inside it, which in turn records their datasets, models, and parameters. This traceability feature spans the whole benchmark system and allows entities and their parameters to be tracked, helping demonstrate how these influence tge metrics.

3.1.8. Setups

Evaluation is a very high-dimensional problem where for every entity there are dozens of options, and the need to test all of those combinations creates a combinatorial explosion.

Some cases, in turn, require testing on only a subset of those options. This is where the XAIB setups can be used. This is a convenience class for creating enumerations of all objects included in the benchmark. Factories can contain a large number of defaults, and sometimes it is easier to specify what models, for example, should not be included in the overall evaluation. The setup accepts the factories for datasets, models, explainers, and cases upon creation, as well as the specifications of what entities should be created later. It is not responsible for initialization; rather, it is just a formal way of setting up an experiment. At the time of use, the setup merely returns the names of entities without creating them.

Specifications are set using the keywords datasets, models, explainers, and cases. They can be set using a list of names, enumerating all the things that users need to include. For the default behavior and ease of use, there is also a special value all that frees the user from the need to input all available entities manually. If “all” is used, the setup can also exclude some of the options with the keywords <entity>_except that allow listing the options for each entity.

This interface allows for the flexible customization of the experimentation process and a readable way to specify what is inside each evaluation. The following Listing is a code excerpt showing how setups can be used in practice.

Listing 1. Setup example.

factories = (
DatasetFactory(), ModelFactory(), ExplainerFactory(), CaseFactory()
)

setups = [
Setup(
*factories,
datasets = ["iris", "digits"],
models_except = ["knn"]
)
]

for setup in setups:
for dataset in setup.datasets:
...

It is important to note that abstractions around evaluation properties do not remove the ability to use lower-level entities from the user. They provide an easier way to work with defaults, and when users want to change default values, they can create entities manually. They were built taking into account that new benchmarking methods should not be hard for a user who is not familiar with the benchmark and its interfaces for each entity if they are only interested in obtaining metric values.

3.2. Modules and Dependencies

Working with different providers without relying on each of them specifically requires a way to deal with dependencies. By default, the XAIB does not require any of the explainer’s or model’s packages. This is purposefully included in the design to make the setup easier for those who do not need to use every explainer or every model. Dependencies, unlike in any other XAI benchmark, are handled per module separately, and if the explainer or model requires some Python packages, they should be installed separately. For complete installation, the XAIB provides a file with all required dependencies that can be installed using one pip install command.

The XAIB Python module is divided into a number of submodules that serve different purposes.

The design of the modules was considered an important part for several reasons. They are mainly used for handling external dependencies when importing and for ensuring clear imports from the user side. The submodules allow for the definition when some external dependency will be imported. The main idea was to isolate submodules so users are not required to install dependencies they do not need, as long as they do not use certain functionalities. There is also import convenience to consider. When using code, users should easily understand the project’s structure. This drives choices on module names and their hierarchy. There is a certain trade-off between the two aspects mentioned, and in the case of XAIB, the choice was made in favor of dependency isolation while still trying to build a comprehensible structure.

Most of the modules collect implementations of different entities and group them into one submodule.

There is base, which is a module for every base class. It has no internal dependencies, but almost every submodule depends on it.

There are entity submodules, namely datasets, models, explainers, metrics, and cases. Modules named datasets and models include the default implementations of the corresponding entities.

Modules with the names explainers, metrics, and cases do not include their entities directly, but are divided into submodules by the type of explanation. Currently, there are feature_importance and example_selection sections inside these modules. Other similar modules that collect entities do not need such division since they do not depend on the explanation type.

There is also a special section called evaluation which is not a module in a strict sense but a container for evaluation workflows and everything that helps them. It was also split into submodules for each type of explanation, but it has common tools for each of them. That module contains factories for datasets, models, explainers, and experiments. This is also where setups are implemented.

The modules mentioned create the structure of the benchmark. There are also additional modules in the project for common code, documentation, and tests.

3.3. Extensibility

Extensibility is a crucial aspect of the XAIB design that dictates most decisions.

The main idea is the kind of benchmark that does not work as an arbiter of quality but rather facilitates the research in quality estimation and provides the platform for experiments. Many of the traditional benchmarks are implemented in the following fashion: there are train and test datasets, the first is published while the second is kept secret. In this scenario, a benchmark accepts a solution in a predefined format, evaluates it on its own using a test dataset, and then places a solution on a leaderboard. All of these measures prevent competitors from overfitting or cheating by using test data.

Most existing explainability benchmarks are designed in a similar way. The data may be open, but this centralized design remains unchanged. Since the XAIB features a multitude of metrics and evaluation setups in general, it is difficult to “overfit” on every setup when designing an explainer. This is why the XAIB does not work like a regular benchmark, and this is why extensibility and unified interfaces are important for it. It allows for the creation of various ways to measure the quality of explainers without rewriting any existing entities or evaluation code.

One of the main features that facilitates this modular design is the way the interfaces work. XAIB entities are designed using the inversion of control principle. They maximize receiving configuration information at the initialization stage rather than at the time of actual usage. For example, explainers accept both a test dataset and a model at initialization time rather than receiving them at the time of the actual explanation. Cases obtain a dataset, a model, and an explainer and then they can be evaluated without specifying this information. This design creates the whole evaluation experiment, which can be transparently created and configured in the same place and then passed to the evaluation pipeline. Creation can be handled by the benchmark itself—this is what factories are for—or by the user for full control over parameters. The created evaluation pipeline can then abstract all implementation details. It will just run an evaluation and record the results.

In this case, further development is expected, not only by project maintainers but also with the help of the XAI community. Thus, the extensible design of the XAIB becomes not only an engineering solution for handling complexity but also a choice to be open and community-driven.

3.4. Versioning

It is hard to underestimate the importance of the thorough tracking of changes and versioning of evaluation software. Missing versioning can cause reproducibility issues. For example, a researcher installs a copy of evaluation software and obtains metric values for their method. Later, when the evaluation procedure is updated, another researcher can evaluate their method using a newer copy. This situation will lead to two researchers reporting inconsistent results. Without an indication of the versions, these inconsistencies may remain unnoticed for a long time when different methods are compared.

Semantic versions provide a meaningful reference, indicating not only what copy was used but also what kind of changes were made. This is why the XAIB employs semantic versioning to enable reproducibility and prevent hidden results inconsistencies. Researchers are encouraged to report the full version number with evaluation results when using the benchmark.

3.5. Users and Use Cases

Since users are considered the main focus of the XAIB, it is important to understand the interests of possible users, which is why this question is considered separately. It was decided to divide users into several groups based on their interests. The groups will be named Researchers, Engineers, and Developers. The names do not define groups themselves but serve as convenient associations for better explanations. All of these groups have different needs that should be addressed with proper tools within the benchmark. Their particular roles are not as important as their goals when using the benchmark in this regard. A wider audience interested in XAI will also be taken into consideration, but it will not be considered as a group. All of the results of the user analysis summarized can be seen in Figure 2. For the sake of simplicity, not every use case is presented in the diagram, instead, some of them are grouped together.

3.5.1. Users

The Researchers group consists of people who are working on their own XAI solutions. The development of an explainer may be at a different stage, and at each stage, they can benefit from a benchmark. In the early stages of development, prototypes can be quickly assessed and compared, and different ideas can be tested against several metrics. The low entry requirements of a benchmark will benefit at this stage in being able to quickly integrate it into the project operations. Users in those stages of development need a way to effortlessly test their own solution without rewriting it or going through the complicated process of getting measurement results.

In the latest stages of development, as well as for XAI solution maintainers, there is a need for comprehensive evaluation and comparison with other existing methods. To satisfy those needs, a benchmark should feature the ability to pass full evaluation without the need to manually recreate complex setups, as well as the ability to obtain the results of other methods to compare.

In the context of evaluation, the ability to plug in your own data or model may also be important for people who develop their methods, as they may already have some datasets and models for debugging purposes. If the same data could be used for evaluation, it would help researchers better understand their method and in a specific context.

People who intend to reproduce the evaluation results will also be attributed to this category. Public benchmark results should be reproducible to address these requirements.

To meet the requirements of Researchers, a benchmark should have the ability to quickly evaluate a method without the need to rewrite existing code, possibly while also using custom data and models, making a full evaluation and comparison with other methods and reproducing published results. For quick start-up, it should feature clear installation instructions, explicitly defined dependencies, and an easy way to run experiments.

Engineers in our categorization scheme are the people who use XAI methods in their own machine learning tasks. Their most important goal with a benchmark could be choosing the most suitable method for their specific task. In various fields where machine learning is employed, the explainability criteria may be different. A benchmark should feature different measurements covering different facets of explainability that may be less or more important depending on the specific application.

Engineers choosing the most suitable solution may want to see the results of all measured methods on a benchmark’s data, but the ability to test several methods on their own setup is what sets the XAIB apart from existing solutions. Given the multitude of measurements a benchmark could have, Engineers may also need a guide for the properties and metrics. They need the ability to identify key properties for their applications and see how different measures can express those properties.

To meet the requirements of Engineers, a benchmark should have a customizable evaluation, published results, and information about properties and metrics it features. For trying out different methods, it should also provide a common, unified interface for each of them.

The group named Developers is made up of people who seek to broaden the picture of XAI research. Their interaction with a benchmark revolves more around development than usage. They may want to propose new ways to measure the quality of explainers or extend a benchmark in some other way. For example, by adding new types of methods to be measured and the corresponding metrics for them, this group of users should have easy access to source code, documentation, and other materials that will help them effectively grasp the structure of a benchmark and how to contribute to its development. A benchmark should be documented and built to be extensible enough to meet the needs of Developers.

A broader audience in the XAI or general ML field may be interested in finding out what comparison criteria are used, what methods are currently performing better, and other details. Their interests should also be considered, since some people can eventually become one of the other three groups.

In this section, users and their possible interests were discussed. Based on that premise, the main requirements were identified. In the next section, this information will be used to show how the XAIB satisfies these requirements.

3.5.2. Use Cases

The use cases presented in this section are based on the potential users identified in the previous one. The use cases are divided into groups depending on how users can interact with the benchmark and its the main functions.

The first group of use cases is utilization of the benchmark. This is an experiment that results in reproduction to verify metric values that were published. There are also different kinds of evaluations as follows: a single method on a specific setup, a method against a full set of metrics, several methods on some setup, etc.

To reproduce the experiment results on the full setup using the XAIB, one needs to install it and run an evaluation. The benchmark provides installation instructions on its documentation website and GitHub page. The instructions themselves are not so complex, since the package is available on PyPI and can be installed with a single command. The list of additional dependencies is also available and can be installed in the same fashion for users wanting a full setup. After installation, users can run the default evaluation scripts. The evaluation itself is separated into modules by explanation type. For each type of explanation, the results are written in files along with the visualizations in the same directory as the evaluation script.

Reproducing methods is important, but for the users who want to test some specific existing method or their own method, something less general is needed. For this purpose, users can use special interfaces that were mentioned in the Structure section.

In the Listing, an example of how to use an existing method is offered. There is a need to create and configure entities that is not within the scope of this experiment. To not overwhelm the user and create additional complexities, all of it is encapsulated within the benchmark itself. Since the particular explainer is the focus of this experiment, the user is responsible for creating and configuring it.

If the user wants to evaluate this method on all of the cases, the previous example can be continued as shown in Listing.

The second group is result inspection. The main function of a benchmark is to provide users with a ranking of competing methods. To be able to compare methods in general, users should be able to see the aggregated results for each method. For a more in-depth view of each method, in particular, metric values for each method should be presented separately. Since the XAI evaluation is very high-dimensional, it requires special care when visualizing results. For a benchmark, the representation of results is one of the most important aspects.

Listing 2. Existing method evaluation example.

from xaib.explainers.feature_importance.lime_explainer import LimeExplainer
from xaib.evaluation import DatasetFactory, ModelFactory

train_ds, test_ds = DatasetFactory().get("synthetic")
model = ModelFactory(train_ds, test_ds).get("svm")

explainer = LimeExplainer(train_ds, labels=[0, 1])
sample = [test_ds[i]["item"] for i in range(10)]
explanations = explainer.predict(sample, model)

Listing 3. Existing method evaluation example.

from xaib.evaluation.feature_importance import ExperimentFactory
from xaib.evaluation.utils import visualize_results

experiment_factory = ExperimentFactory(
repo_path = "results",
explainers = {"lime": explainer},
test_ds = test_ds,
model = model,
labels = [0, 1],
batch_size = 10

)

experiments = experiment_factory.get("all")
for name in experiments:
experiments[name]()

visualize_results("results", "results/results.png")

The XAIB can help those wanting to inspect evaluation results by using the information published on the documentation website. For a general comparison, there is an aggregated bar plot with all the methods averaged over all setups. The website also provides results for each method per dataset for comparing the influence of datasets on the method’s performance. In addition to that, a full table with all metric values for each combination tested is also published. All visualizations are automatically produced using saved evaluation results.

Aside from the results, the documentation itself provides information on every entity of a benchmark as follows: datasets with general descriptions of features and links to the sources, models with categorization based on task and interpretability, explainers along with baseline explainers, brief descriptions, and source links. Explainers are divided according to their type. Metrics are divided in the same fashion; their descriptions contain information on the case they belong to and other information such as direction (the more, the better, or the less, the better) and a link to the source code.

The third group of cases is extending the benchmark. For a benchmark to stay up-to-date, it needs to incorporate the latest methods. To grow and extend with the field itself, a benchmark should be easily extensible, even for an external developer. This requires a set of interfaces for each entity and detailed documentation on how to contribute.

Writing their own implementation of an explainer is required not only for users who want to extend the benchmark but also for those wanting to use their own solution that is not featured in the defaults. To satisfy the needs of those users, the XAIB provides documentation written specifically on the topic of implementing new explainers.

The Listing demonstrates the process of adding a new explainer. The user is required to set a name and implement a method that is used to obtain explanations.

Listing 4. Example of adding an explainer.

import numpy as np
from xaib import Explainer

class NewExplainer(Explainer):
def __init__(self, *args, **kwargs):
self.name = "new_explainer"
super().__init__(*args, **kwargs)

def predict(self, x, model):
return np.random.rand(len(x), len(x[0]))

4. Experimental Evaluation

This section provides a description of how experiments are conducted within the benchmark. First, it describes the setup needed for the experiments and then briefly describes the metrics used and the results obtained. Since the focus of this work is not the set of metrics, the description is very concise. In the final part, the novel property of the benchmark is described.

Experiments serve as a central evaluation workflow that is embedded in the benchmark structure as a separate module. It is implemented as a set of scripts using XAIB tools to evaluate every compatible combination of dataset, model, explainer, and metric. The main goals of building this default workflow are reproducibility and automation in obtaining complete results. However, the default workflow is just one of the many possible ways of using the XAIB to evaluate XAI methods, and users are encouraged to create their own procedures.

Experiments are divided into two parts according to the explainer type as follows: feature importance and example selection, as in this case. For each of the groups, their own metrics were chosen for different properties.

It is important to note the role of experiments and metrics in this work. Although metric values and concrete measurement details are at the core of the evaluation, in this paper, they are intentionally omitted to maintain the focus of the work on the evaluation infrastructure itself. Experiments in this work demonstrate the capabilities of the benchmark to extend, enabling the evaluation of various types of explainers. This is the reason some details about metrics may be omitted in the following sections. However, since the role of metrics is still crucial, they hold a separate part on the benchmark’s documentation website.

In this section, the results of the experiments will be discussed. First, the setup will be briefly described, a set of metrics will be outlined, and finally, the actual metric values will be presented.

4.1. Experiment Setup

Corresponding to different types of explanation methods, two experiment runs were conducted as follows: one for feature importance and the other for example selection methods. In the following section, experiments are described, and their results are interpreted and discussed. To set up an experiment with the given set of metrics in the benchmark, one needs to provide a dataset, a trained model, and an explainer. Numerous combinations of datasets and models exist, but datasets, models, explainers, and cases are not always compatible with each other. To avoid illegal comparisons, experiments were run while manually excluding them from the list using the setup functionality.

4.1.1. Datasets

All datasets that were included in the experiments are commonly known among machine learning researchers and are available using the scikit-learn library interface [23]. At this stage of benchmark development, only small datasets were included to be able to experiment with the development of the benchmark quickly, without the need for special treatment or hardware for large-scale datasets. This principle applies to models as well. Since the benchmark was built to be extensible, larger datasets and models will be integrated into it further in the development process.

On the side of the benchmark, a special sklearn wrapper was implemented to adapt the interfaces. It accepts the name of the dataset from the library and the name of the split either “train” or “test”. Then, it fetches the mentioned dataset, splits it, and returns the wrapper. The need to wrap datasets is crucial because of the differences between interfaces. In the XAIB, datasets are represented as a collection of data points that are accessible with indices. Each data point is a dictionary with keys that represent the names of the columns or channels inside a dataset. Special wrappers for the sklearn datasets transform raw numpy arrays returned from the library to the interface described before.

The datasets that were used in all experiment runs are the following: breast_cancer [24], digits [25], wine [26], iris [27], and two synthetic datasets named synthetic and synthetic_noisy that were generated with custom parameters.

The dataset named synthetic was generated using the standard scikit-learn method. Artificial data are crucial in these experiments to effectively debug every method since we can put methods and models under controllable conditions. For experiments, a 14 × 100 dataset was generated for the binary classification task. The values are positive and negative. All of them were assigned to be important. No interactions or repeated features were inserted. The points form two clusters, one for each class.

The dataset named synthetic_noisy was brought up to understand how robust the methods are. It was generated in the same fashion as a regular synthetic dataset but with different settings as follows: 100 rows with 14 features each, of which 7 are informative, 5 are redundant, and 2 are repeated. Each class has two clusters. Each dataset was separated using the ratios 0.8 and 0.2 for the training and evaluation sets, respectively.

4.1.2. Models

The models that were used include an SVM-based classifier and a neural network for feature importance experiments and K-nearest neighbors for example selection ones. SVM and NN are considered black-box models, and this is why they were chosen as the first models for evaluation. Similarly, as with the datasets, to not overcomplicate initial benchmark development, large-scale models were not included, but they can be added later in the development and expansion. All models were initialized with the default library configurations.

4.1.3. Explainers

The compared methods include both baselines and real feature importance methods. The real methods are represented by shap [28] and LIME [29]. Shap uses the default method provided by the library with the default configuration. LIME was also used without any custom setup, with default values for the number of samples and distance metric.

In the case of example-based methods, KNN with different distance measures was used. KNN is an interpretable model and is used both as a model and as an explainer for itself.

It is important to note that every explanation method outputs a different distribution of values. Shap, for example, gives positive and negative importance scores as well, indicating different contributions of features. For the correct comparison to obtain comparable metric values, all explanations were normalized for each batch. The minimum value was added to each batch, and then, all values were divided by the range of values. If the range is zero and all values are zero, then the vector remains the same, and if all values are equal, then the all-ones vector is returned.

Baselines are a very important part of the evaluation because they can serve as an estimate of the best, average, or worst theoretical case for a metric. The constant explainer was configured to always return dummy results in the form of vectors of all ones. Furthermore, the random baseline was set to test the normalization so as to return values from an arbitrary negative range from −25 to −5. Example selection baselines are different from feature importance ones. The constant always returns the first element of the dataset, and the random one for each call returns some random example. Baselines are important for the utility of metrics, as they can serve as sanity checks for them. A metric can be considered useful if it measures the desired property. So, when it is certainly known that the property is not satisfied, the metric should indicate this by showing the difference between the baseline dummy method and a real one.

After the preparation of datasets and training models, all metrics were computed for every combination of dataset, model, explainer, and case. For each type of explanation, separate experiment pipelines were carried out with common models and datasets. The version of the benchmark that is considered is 0.4.0, which was the latest at the moment of writing this article. All entities currently present in the XAIB are presented in Table 1.

4.2. Metrics

One of the main are where the XAIB contributes to the evaluation of XAI is implementing the clear evaluation system proposed in the work of Nauta et al. [8] in practice. The Co-12 framework is a set of desirable properties for an explainer that can be used to evaluate and compare different methods.

Since the focus of this work is on the benchmark itself, the details of metric implementations will be left outside the scope of this work, but they are still available in the documentation; their importance is well understood.

The results of the experiments still require explanations for the meaning of the metrics that were computed, and this is what this section is devoted to.

Not every one of the 12 properties for every explainer type was covered during the implementation of metrics; some of them may be inapplicable, and some were not covered yet. In the next section, only the properties currently addressed are listed.

Correctness for example, was measured for both types of explainers in different ways with a metric called model randomization check (MRC). A Correctness metric should indicate that the explanations describe the model correctly and truthfully. The explanations may not always be reasonable to the user, but they should be true to the model to satisfy this criterion. Continuity is measured for both types with a small noise check (SNC), which measures the stability of the explanation function. Continuous functions are desirable because they are considered more predictable and comprehensive. Contrastivity measure should show how discriminative the explanation is in relation to different targets. The contrast between different concepts is very important, and the explanation method should explain instances of different classes in different ways. For this benchmark, we implemented label difference (LD) and target discriminativeness (TGD) for feature importance and example selection types, respectively. Coherence metric should show to what extent the explanation is consistent with the relevant background knowledge, beliefs, and general consensus. The agreement with domain-specific knowledge can be measured, but this can be difficult to define and it can be very task-dependent. The measure can be proxied using the agreement between different methods. It is represented by a metric called different methods agreement (DMA) for feature importance and same class check (SCC) for example selection. Compactness measures the size of the explanations. The explanations should be sparse, short, and not redundant. The size can be measured directly in some cases. In the case of feature importance, the size is always the same, but sparsity is different. This is why Compactness was measured through sparsity (SP) for feature importance. Covariate Complexity means that the features used in the explanation should be comprehensible. Furthermore, non-complex interactions between features are desired. It is measured as covariate regularity (CVR) for both types. For the precise formulations and implementations of all the metrics and up-to-date information, the reader is encouraged to visit the documentation.

Although metrics were implemented from scratch, they are mostly very similar to the existing solutions. Measures of Correctness, Continuity, Contrastivity, and Coherence are well known. Compactness and Covariate Regularity may be more novel.

In this paper, the description of the metrics is intentionally not provided to emphasize a key idea. We believe that the benchmark is more than a set of metrics. The initial proof-of-concept features a set of metrics with the main intention of demonstrating the capabilities of the XAIB. Metrics are a central part of the benchmark, but they are outside the scope of this work. The work itself focuses on how to provide any metric to the user.

The development of new metrics is an area of research in itself and the detailed description of metrics is worth a separate paper. The flexibility of the XAIB should help facilitate further research in this area. The benchmark and its evaluation process should not be limited to the set of metrics mentioned; it is designed to be further extended to cover more properties. The set of properties already covered and the implementation details of the metrics may not be clearly justified in this paper due to scope limitations.

The properties themselves should not be considered covered when only one metric is implemented for them. The metrics are not perfect for every use case, and a more complete picture for single property can be derived when measured using different metrics.

Considering this, the set and formulations of metrics always evolve with time, and the implementation of new metrics or metrics that are already known to the community represents the direction for the future development of the XAIB.

4.3. Experiment Results

In Table 2, the results of the tested feature importance methods are presented for each metric. Bold represents the best result among real methods; the baselines are excluded. The arrows represent the direction of a metric as follows: up is “the greater the better”, and down is “the less the better”.

In the correctness metric, no real method was better than the baseline random method, which means that both methods are not as true and sensitive to the model as they may seem. However, random explanations have a great advantage when it comes to the representation of a random model, so this is expected.

The model that is best for continuity is the constant explainer because explanations never change with any noise, but shap is notably very close to that value as well. Shap is also the most contrastive method. It is suggested that this is due to the fact it influences different features on the label that may be positive or negative.

LIME is the most feature-simple method; it generates more understandable vectors and is also more sensitive to the model.

The coherence values of the two methods are very close. This metric, in the way it is computed, benefits from more methods to form some sort of “common sense”, so there should be more methods to gain confidence in those values.

The results of the example selection methods are presented in Table 3. KNN-based approaches show the best results on almost all metrics. The most interesting result is the result on target discriminativeness (

T G D

). This means that the selection of examples from KNN trains a better model than a random selection of them, which may be the basis for further experiments on this matter.

The updated results, along with descriptions of the data, models, methods, and metrics are available on the documentation website. The development is active, which means that new entities will appear with each release, further broadening the XAI evaluation landscape.

4.4. Method Comparison

In terms of evaluation, one of the most important aspects that sets the XAIB apart from existing benchmarks is its ability to compare explainers in different contexts.

In most existing XAI evaluation solutions, this is not the case. Datasets and models are treated as constants in evaluation experiments and are usually embedded or hardcoded. This is acceptable for research and the theoretical study of explanation quality. However, when it comes to the evaluation of explainers for specific applications, when models and datasets are different from the evaluation scenarios, this scheme is not flexible enough to allow it.

In the XAIB, the nature of most metrics makes data and models pluggable into a measurement process. Since a dataset and a model are variables, it easily enables, for example, the kind of experiments where one of them is varied while the other stays constant. This and other experiments, including evaluations on a specific scenario, also become available.

Using the terms established earlier, previous works on benchmarking XAI methods focused on the needs of Researchers, but not those of the Engineers. With the benchmark proposed, Engineers are now able to use their own models and data to evaluate several explainers on their own setup and figure out what works best for the application.

The following example demonstrates how different metric values can be derived when the dataset–model pair changes while the explainers stay the same. In Figure 3, the results of shap and LIME on the SVM classifier trained on the breast_cancer dataset are shown against all metrics. Metric values were not taken as this is only ransformed for visualization purposes. They were normalized to the range [0, 1], and the ones with a downward direction (the lower the better) were multiplied by −1. This transformation makes all metrics comparable in one graph and gives them an upward direction. Taking into account these results, shap achieves better metric values than LIME on almost all metrics. In this case, if there are no preferences for some set of metrics, one could choose shap over LIME.

However, if the pair dataset–model changes, (assuming that another task is solved), the results change as shown in Figure 4. Methods change places, and LIME becomes better on a majority of metrics, so if there are no preferences among metrics, one could choose LIME for this application. This example serves as a demonstration of how the XAIB can facilitate experiments for Researchers and enable setup-specific evaluation for Engineers at the same time.

5. Discussion

In this section, a detailed comparison will be made and the limitations of the benchmark will be discussed. The proposed solution has limitations, both in the field of evaluation metrics and the software part of it. After that, possible directions for future research will be proposed.

5.1. Comparison

In order to highlight the contribution of this work, this section provides a comparative analysis of the benchmark proposed against the currently existing solutions. Flexibility, extensibility, and usability are the main focus of this analysis, although other criteria such as interpretability abd property coverage were also considered. Table 4 summarizes the overview of different aspects of the existing XAI benchmarks and also includes the one proposed in this work.

The flexibility of an XAI benchmark can be achieved in many ways. Given the diversity of the ML landscape with different data types, tasks, models, and ways to explain them, a flexible benchmark should be able to cover as much as possible.

However, current solutions do not provide such flexibility in terms of tasks, models, and especially explanation types. Most of the works do not go beyond tabular data classification. All of them solely focus on feature importance methods. The XAIB, in turn, provides an example of a flexible benchmark, offering the ability to evaluate not only feature importance explainers but also those that use examples as explanations. It can be argued that the only pair of a data type and a task implemented in the XAIB is also the classification over tabular data, but there is one crucial difference. Compared to other solutions, the XAIB was built with different tasks in mind. Although tabular data classification was the first choice for obvious reasons, the addition of other data types and tasks can occur without rebuilding everything from scratch. The applicability row in Table 4 illustrates this very clearly. Most of the other solutions can only use predefined sets of datasets, models, explainers, or tests.

The extensibility is closely related to the previous property. The extensible benchmark should not only allow the evaluation of different explainers (this would be the borderline), but it should also provide means of experimentation with different data, models, and metrics.

The applicability and documentation rows in Table 4 suggest that some solutions offer the ability to evaluate custom explainers. For example, OpenXAI, Compare-xAI, and

M^{4}

have this extensibility option. Alternative directions are not currently allowed. For example, users who want to evaluate an explainer on their own dataset and model cannot use OpenXAI or Compare-xAI. They can leverage

M^{4}

if their data are images or text and their model is a neural network. In other cases, those users have no options.

In this regard, at present, only the XAIB provides a full set of extension directions. It provides the user with the ability to experiment with their own implementations of datasets, models, explainers, and metrics. As long as interface compatibility is ensured, users can combine existing entities with custom ones, which is supported not only with the code but also with the detailed documentation.

Usability is an important criterion in the open source community. The benchmark should be open, providing the documentation and the results table. It should not be difficult to setup and run experiments.

Almost every existing benchmark provides some documentation that highlights different aspects of its usage. Mostly, this covers the basics, for example, how to reproduce results. Usually, there is no information on how to submit the new method or contribute to the development in any other way. Providing up-to-date results is also an important and seemingly underrepresented feature. The ability to easily install and use the package is also excluded from most of the methods.

The proposed benchmark provides detailed documentation, not only on reproducing the results but also on usage with custom entities and separate sections on how to contribute. Users can also install the XAIB with a single command, making it much more accessible to users from different scientific backgrounds, who may find manual installation more difficult.

Table 5 shows which of the Co-12 properties are measured with the metrics of existing benchmarks. The benchmarks feature numerous metrics; however, in terms of completeness in property coverage, they lack variability. The coverage of the Co-12 properties was analyzed by attributing each of the metrics presented in the respective papers to some of the properties whose definition best fits the given metric. The existing benchmarks feature multitudes of metrics that should facilitate a comprehensive evaluation. However, the particular set of metrics that is chosen each time is rarely discussed or justified. The completeness of the chosen metrics is also a topic that is usually not brought up in papers. After analyzing the proposed metrics through the lens of one of the most complete systems of properties available, their completeness becomes clearer.

For example, the saliency eval benchmark provides the Human Agreement metric, which is considered to belong to the Context property; Faithfullness measures Correctness; Confidence Indication is conceptually very similar to Contrastivity; and the Rationale/Dataset Consistency metrics could belong to the Consistency property.

The XAI-Bench is very narrow in terms of properties; however, it goes deep into the measures of Completeness. Faithfullness, ROAR, Monotonicity, and Infidelity are all different measures of the same property and can reflect different aspects of it. GT-Shapley is likely to measure Coherence since this ground truth can reflect some form of “common sense”.

OpenXAI, although featuring 22 metrics, does not cover as many properties. They feature GT Faithfullness, similarl to the previous case, is a measure of coherence. Aside from that, metrics are divided into two groups as follows: Faithfullness as a measure of Correctness and Stability as a measure of Continuity.

Compare-xAI provides a multitude of tests which are divided into six categories as follows: Fidelity and Fragility as measures for Correctness; Stability, which is similar to OpenXAI and is also a metric for Continuity; Simplicity, which is likely to measure Coherence; and Stress with Other, which are directed to Context property.

M^{4}

features five metrics, even though they are all different measures of Faithfullness, which in our case is named Correctness.

Analyzing the table according to the classification made by Nauta et al., the properties categorized as Content (Correctness, Completeness, Consistency, Continuity, Contrastivity, Covariate Complexity) are the most covered. Properties in the categories Presentation (Compactness, Composition, Confidence) and User (Context, Coherence, Controllability) are the ones that are poorly covered at this moment.

Considering the comparison made, XAIB is more flexible, allowing for the deep customization of the evaluation process. It helps widen the scope of XAI evaluation by featuring more types of explanation and allowing the use of custom datasets and models in the evaluation. It opens new extensibility directions that were previously unaddressed, while making steps towards usability by introducing clean versioning, dependency management, and distribution.

In addition, it also covers most of the properties, making it the broadest existing XAI benchmark. Although the XAIB implements only six out of twelve properties across different explainer types at the initial stage, all properties can be covered in the future.

5.2. Limitations

Functional ways to assess quality seem to be a good solution. They are very cheap to compute compared to user studies and provide clear comparison criteria. As long as the metrics themselves are interpretable enough, it should be easy to make a decision by comparing numbers. Those may be the reasons why functional evaluation is gaining more popularity at the moment. There is a certain demand to bring XAI evaluation to the standard of AI evaluation. However, this does not seem possible. The main difference between AI and XAI, and the main reason for this impossibility, is the presence of a human. Since interpretability depends solely on human perception, it becomes difficult to formalize, in contrast to performance measures, which are formulated mathematically in the first place and, in essence, are a convenient way to aggregate lots of observations. In addition to that, accuracy (in a broad sense) and its properties are well defined, while interpretability is not, which is a major point of criticism for the field. Considering this, one should approach functional measures with care and perceive them only as guides and proxies of the real interpretability properties they are trying to represent.

All of the above means that when working in high-risk conditions and when building reliable machine learning systems that will have a great influence on human lives, one should not rely solely on the values of quantitatively and functionally obtained quality metrics. For a complete evaluation, human-grounded and functionally grounded experiments are required. Only by using insights from every method of quality measurement can one make an informed decision.

In the following paragraph, the existing implementation limitations will be considered. The XAIB was designed to be universal and easily extensible. However, those qualities come with their own set of trade-offs. Analyzing them, the following difficulties were identified: performance issues, combinatoric explosion, compatibility management issues, and the cost of implementation.

Performance issues arise when building systems that have many independent components. This property impedes advanced optimizations that could be made if the solution was one-piece and specific. The abstractness and independence of the components create the conditions for additional work that should be done to manage this, which adds to performance costs. In addition to that, machine learning, in general, tends to be more demanding in terms of computing. Having compute- and memory-demanding solutions creates challenges that should be addressed in the future. The limited set of initial entities that are currently features in the benchmark is an intentional choice that was made to make the development easier by not trying to create premature optimizations for demanding solutions and focusing solely on the main benchmark principles.

Considering the number of different entities that make up an XAI benchmark, the evaluation task is highly dimensional. Adding a single new entity means creating a number of combinations with entities of other types. This could potentially create an exponential increase in evaluation runs for each method. Although this issue is virtually unavoidable, its computation costs can still be mitigated, and performance in this case does not look like a major concern. The management of such a system seems to be the most difficult thing to overcome. Each entity, dataset, model, or explainer can be incompatible with the others. For example, a dataset without any class labels can be incompatible with the classification model, which, in turn, can be incompatible with some types of explainers that do not work with some metrics, etc. Not only can this quality of the evaluation task cause errors when working with incompatible entities, but one should also carefully approach the solutions to addressing this problem. Some solutions can introduce a lot of additional complexity, and in this case, a good trade-off is needed to ensure ease of use.

Aside from those difficulties, result aggregation and interpretation also become problems that should be addressed in future research. This brings us to the last limitation mentioned, the implementation cost. When the system is abstract and every entity has a lot of metadata that are required for the system to work, it becomes difficult to derive new solutions. Although this is not the case at the current stage, with the expansion of the benchmark, a metadata handling and validation system should be built to ensure the efficient compatibility management. Furthermore, it seems inevitable that more weight will be put on the shoulders of new contributors.

Each limitation can be regarded as a challenge to future research and the development of the XAIB.

5.3. Future Work

Future work on the XAIB could focus on the extension of the evaluation landscape and addressing the numerous challenges that come along with this process.

Evaluation can be extended in numerous directions, but filling the gaps in the case coverage and different explanation types should be a priority.

Performance issues when dealing with compute-intensive solutions from a benchmark perspective can be addressed by minimizing the number of calls to the heavy algorithms. One of the ways this could be implemented is through extensive caching. For example, storing and retrieving already trained models and caching predictions and explanations between different experiments.

One of the most important issues—compatibility—could be addressed by implementing a data validation mechanism inside the benchmark that would not allow the use of incompatible entities and would help users test the correctness of their own implementations.

Research can also benefit from a more detailed comparison of XAI benchmarks. Since usability is one of the main focuses of the XAIB, the usability comparison with similar solutions can be deepened.

Future development should focus on expanding the benchmark, addressing the challenges mentioned, and others that will arise as the development progresses.

6. Conclusions

Being a relatively new field, XAI seems to lack common ground concerning numerous questions. This disagreement ranges from different definitions of central terms to the varying language that is used to describe similar phenomena. However, the level of responsibility placed on the field does not match. Solutions that emerge as attempts to create certain evaluation standards have a very narrow scope, either in terms of the type of explanations (which almost always feature importance) or evaluation metrics.

The benchmark that was proposed in this work is an attempt to fill those gaps. It is designed to include various explanation types and metrics, feature interfaces that are built to be extensible, and documentation covering every aspect of this process. The use of the latest advancements in XAI evaluation enabled us to build the XAIB on a complete framework of interpretability properties.

Providing easier access to the evaluation of explainers, the XAIB aims to become not only a benchmark in the traditional sense but also a platform for evaluation experiments, which is the foundation for further research in XAI.

Author Contributions

Conceptualization, S.K.; methodology, K.B.; software, I.M.; validation, K.B.; formal analysis, I.M.; investigation, I.M.; resources, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, K.B.; visualization, I.M.; supervision, K.B.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Russian Science Foundation, agreement No. 24-11-00272, https://rscf.ru/project/24-11-00272/ (accessed on 29 October 2024).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bryce Goodman, S.F. European union regulations on algorithmic decision-making and a “right to explanation”. arXiv 2016, arXiv:1606.08813. [Google Scholar]
Markus, A.F.; Kors, J.A.; Rijnbeek, P.R. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. J. Biomed. Inform. 2021, 113, 103655. [Google Scholar] [CrossRef] [PubMed]
Abdullah, T.A.; Zahid, M.S.M.; Ali, W. A review of interpretable ML in healthcare: Taxonomy, applications, challenges, and future directions. Symmetry 2021, 13, 2439. [Google Scholar] [CrossRef]
Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable machine learning—A brief history, state-of-the-art and challenges. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2021; pp. 417–431. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; Van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Bodria, F.; Giannotti, F.; Guidotti, R.; Naretto, F.; Pedreschi, D.; Rinzivillo, S. Benchmarking and survey of explanation methods for black box models. Data Min. Knowl. Discov. 2023, 37, 1719–1778. [Google Scholar] [CrossRef]
Huysmans, J.; Dejaeger, K.; Mues, C.; Vanthienen, J.; Baesens, B. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst. 2011, 51, 141–154. [Google Scholar] [CrossRef]
Kulesza, T.; Stumpf, S.; Burnett, M.; Yang, S.; Kwan, I.; Wong, W.K. Too much, too little, or just right? Ways explanations impact end users’ mental models. In Proceedings of the 2013 IEEE Symposium on Visual Languages and Human Centric Computing, San Jose, CA, USA, 15–19 September 2013; pp. 3–10. [Google Scholar]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 2018, 31, 9525–9536. [Google Scholar]
Hase, P.; Bansal, M. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? arXiv 2020, arXiv:2005.01831. [Google Scholar]
Zhang, H.; Chen, J.; Xue, H.; Zhang, Q. Towards a unified evaluation of explanation methods without ground truth. arXiv 2019, arXiv:1911.09017. [Google Scholar]
Bhatt, U.; Weller, A.; Moura, J.M. Evaluating and aggregating feature-based model explanations. arXiv 2020, arXiv:2005.00631. [Google Scholar]
Sokol, K.; Flach, P. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 56–67. [Google Scholar]
Atanasova, P.; Simonsen, J.G.; Lioma, C.; Augenstein, I. A diagnostic study of explainability techniques for text classification. arXiv 2020, arXiv:2009.13295. [Google Scholar]
Liu, Y.; Khandagale, S.; White, C.; Neiswanger, W. Synthetic benchmarks for scientific research in explainable machine learning. arXiv 2021, arXiv:2106.12543. [Google Scholar]
Agarwal, C.; Krishna, S.; Saxena, E.; Pawelczyk, M.; Johnson, N.; Puri, I.; Zitnik, M.; Lakkaraju, H. Openxai: Towards a transparent evaluation of model explanations. Adv. Neural Inf. Process. Syst. 2022, 35, 15784–15799. [Google Scholar]
Belaid, M.K.; Hüllermeier, E.; Rabus, M.; Krestel, R. Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark. arXiv 2022, arXiv:2207.14160. [Google Scholar]
Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Naser, M. An engineer’s guide to eXplainable Artificial Intelligence and Interpretable Machine Learning: Navigating causality, forced goodness, and the false perception of inference. Autom. Constr. 2021, 129, 103821. [Google Scholar] [CrossRef]
Scikit Learn. Toy Datasets. Available online: https://scikit-learn.org/stable/datasets/toy_dataset.html (accessed on 29 October 2024).
Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Wisconsin Diagnostic Breast Cancer Database; UCI Machine Learning Repository. 1993. Available online: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (accessed on 28 October 2024). [CrossRef]
Garris, M.D.; Blue, J.L.; Candela, G.T.; Grother, P.J.; Janet, S.; Wilson, C.L. NIST Form-Based Handprint Recognition System; National Institute of Standards and Technology: Gaithersburg, MD, USA, 1997. [Google Scholar]
Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, J.; Reis, J. Wine Quality. UCI Machine Learning Repository. 2009. Available online: https://archive.ics.uci.edu/dataset/186/wine+quality (accessed on 28 October 2024). [CrossRef]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]

Figure 1. Three stages of the general XAIB workflow—Setup, Experiment, and Visualization. Each Setup is a unit of evaluation. It contains all the parameters and entities needed to obtain the values. The execution pipeline takes setups and executes them, writing down the values. The values can then be manually analyzed or put into the visualization stage.

Figure 2. Use case diagram with groups of users. Arrows represent interactions with different components of the XAIB. Each group has different goals; therefore, their interactions are different. Developers contribute new functionalities and entities. Researchers and Engineers interact in a similar way but have different goals. Researchers propose their own method; for them, setup is a variable. When Engineers select a method for their own task, for them, the method is a variable.

Figure 3. Results on the first setup—SVM—on the breast cancer dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, shap outperforms LIME on the majority of metrics.

Figure 4. Results on the second setup —NN—on synthetic noisy dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, LIME outperforms shap on the majority of metrics.

Table 1. Summary of entities present in XAIB. The set of entities is not fixed by design and was created to support proof-of-concept with the intention to expand the number of available options. For the abbreviations of metrics, see Section 4.2.

Entity Type	Names
Datasets	breast cancer, digits, wine, iris, synthetic, synthetic noisy
Models	SVC, MLPClassifier, KNeighborsClassifier
Feature importance explainers	Constant, LIME, Random, Shap
Example selection explainers	Constant, KNN, Random
Feature importance metrics	MRC, SNC, LD, DMA, SP, CVR
Example selection metrics	MRC, SNC, TGD, SCC, CVR
Cases	Correctness, Continuity, Contrastivity, Covariate Complexity, Compactness, Coherence

Table 2. Evaluation results for the feature importance methods. The score for each method is averaged across datasets and models. Arrows represent the direction of the metric as follows: the greater the better (↑) or the lower the better (↓). The best performing method is shown in bold.

Method	(MRC ↑)	(SNC ↓)	(LD ↑)	(DMA ↓)	(SP ↑)	(CVR ↑)
Const	0.00	0.00	0.00	1.80	0.00	0.00
Random	1.78	1.79	1.77	1.81	0.31	56.38
shap	0.88	0.23	1.27	1.16	0.11	71.06
LIME	0.99	0.48	0.86	1.28	0.16	69.51

Table 3. Evaluation results for the example selection methods. The score for each method is averaged across datasets and models. Arrows represent the direction of the metric as follows: the greater the better (↑) or the lower the better (↓). The best performing method is shown in bold.

Method	(MRC ↑)	(SNC ↓)	(TGD ↑)	(SCC ↓)	(CVR ↑)
Const	1.00	1.00	0.18	0.25	12.98
Random	0.00	0.00	0.28	0.37	16.38
KNN (l2)	0.98	0.62	0.63	0.65	16.80
KNN (cos)	1.00	0.63	0.63	0.65	16.84

Table 4. Comparison table of existing XAI benchmarks.

Benchmark	Saliency Eval [17]	XAI-Bench [18]	OpenXAI [19]	Compare-xAI [20]	$M^{4}$ [21]	XAIB (Ours)
Code publication date	Sep 2020	Jun 2021	Jun 2022	Mar 2022	Dec 2022	Oct 2022
Use of ground truth	Human annotations	Synthetic	Logistic regression coefficients	No	No, pseudo, synthentic	No
Documentation	No	Reproduce	Use	Reproduce, add explainer, add test	Reproduce, sse	Reproduce, use, add dataset, add model, add explainer, add metric
Explanation types	Feature importance	Feature importance	Feature importance	Feature importance	Feature importance	Feature importance, example-based
Data types	Text	Tabular	Tabular	Tabular	Text, image	Tabular, more will be implemented
ML tasks	Classification	Classification	Classification	Classification	Sentiment, classification	Classification, more will be implemented
Results	In paper	In paper	Online	Online	In paper	Online
Applicability	Specific datasets	Synthetic tests	Specific datasets/models	Hardcoded tests	Datasets and models compatible with PaddlePaddle	Any compatible datasets/models
Versioning	No	No	SemVer	No	SemVer (as a part of InterpretDL)	SemVer
Distribution	No	No	No	No	PyPI (as a part of InterpretDL)	PyPI

Table 5. Co-12 property coverage of existing XAI benchmarks. Properties do not correspond 1:1 to metrics. Each property can be measured in various ways. Existing XAI benchmarks can have a multitude of metrics while measuring only a limited amount of properties. Bold text marks properties that are unique contributions of the XAIB.

Benchmark	Saliency Eval [17]	XAI-Bench [18]	OpenXAI [19]	Compare-xAI [20]	$M^{4}$ [21]	XAIB (Ours)
Correctness	Yes		Yes	Yes	Yes	Yes
Continuity			Yes	Yes		Yes
Coherence		Yes	Yes	Yes		Yes
Completeness		Yes
Covariate Complexity						Yes
Compactness						Yes
Contrastivity	Yes					Yes
Consistency	Yes
Context	Yes			Yes
Confidence
Composition
Controllability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moiseev, I.; Balabaeva, K.; Kovalchuk, S. Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms 2025, 18, 85. https://doi.org/10.3390/a18020085

AMA Style

Moiseev I, Balabaeva K, Kovalchuk S. Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms. 2025; 18(2):85. https://doi.org/10.3390/a18020085

Chicago/Turabian Style

Moiseev, Ilia, Ksenia Balabaeva, and Sergey Kovalchuk. 2025. "Open and Extensible Benchmark for Explainable Artificial Intelligence Methods" Algorithms 18, no. 2: 85. https://doi.org/10.3390/a18020085

APA Style

Moiseev, I., Balabaeva, K., & Kovalchuk, S. (2025). Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms, 18(2), 85. https://doi.org/10.3390/a18020085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu