Open and Extensible Benchmark for Explainable Artificial Intelligence Methods
<p>Three stages of the general XAIB workflow—Setup, Experiment, and Visualization. Each Setup is a unit of evaluation. It contains all the parameters and entities needed to obtain the values. The execution pipeline takes setups and executes them, writing down the values. The values can then be manually analyzed or put into the visualization stage.</p> "> Figure 2
<p>Use case diagram with groups of users. Arrows represent interactions with different components of the XAIB. Each group has different goals; therefore, their interactions are different. Developers contribute new functionalities and entities. Researchers and Engineers interact in a similar way but have different goals. Researchers propose their own method; for them, setup is a variable. When Engineers select a method for their own task, for them, the method is a variable.</p> "> Figure 3
<p>Results on the first setup—SVM—on the breast cancer dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, <b>shap outperforms LIME</b> on the majority of metrics.</p> "> Figure 4
<p>Results on the second setup —NN—on synthetic noisy dataset. Metric values are normalized for visualization. Each line represents a single explanation method. In this setup, <b>LIME outperforms shap</b> on the majority of metrics.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Explanation Types
2.2. XAI Evaluation
2.3. Benchmarks
3. Benchmark Structure
3.1. High-Level Entities
3.1.1. Data
3.1.2. Models
3.1.3. Explainers
3.1.4. Metrics
3.1.5. Cases
3.1.6. Factories
3.1.7. Experiments
3.1.8. Setups
Listing 1. Setup example. |
|
3.2. Modules and Dependencies
3.3. Extensibility
3.4. Versioning
3.5. Users and Use Cases
3.5.1. Users
3.5.2. Use Cases
Listing 2. Existing method evaluation example. |
|
Listing 3. Existing method evaluation example. |
|
Listing 4. Example of adding an explainer. |
|
4. Experimental Evaluation
4.1. Experiment Setup
4.1.1. Datasets
4.1.2. Models
4.1.3. Explainers
4.2. Metrics
4.3. Experiment Results
4.4. Method Comparison
5. Discussion
5.1. Comparison
5.2. Limitations
5.3. Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bryce Goodman, S.F. European union regulations on algorithmic decision-making and a “right to explanation”. arXiv 2016, arXiv:1606.08813. [Google Scholar]
- Markus, A.F.; Kors, J.A.; Rijnbeek, P.R. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. J. Biomed. Inform. 2021, 113, 103655. [Google Scholar] [CrossRef] [PubMed]
- Abdullah, T.A.; Zahid, M.S.M.; Ali, W. A review of interpretable ML in healthcare: Taxonomy, applications, challenges, and future directions. Symmetry 2021, 13, 2439. [Google Scholar] [CrossRef]
- Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable machine learning—A brief history, state-of-the-art and challenges. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2021; pp. 417–431. [Google Scholar]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
- Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; Van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
- Bodria, F.; Giannotti, F.; Guidotti, R.; Naretto, F.; Pedreschi, D.; Rinzivillo, S. Benchmarking and survey of explanation methods for black box models. Data Min. Knowl. Discov. 2023, 37, 1719–1778. [Google Scholar] [CrossRef]
- Huysmans, J.; Dejaeger, K.; Mues, C.; Vanthienen, J.; Baesens, B. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst. 2011, 51, 141–154. [Google Scholar] [CrossRef]
- Kulesza, T.; Stumpf, S.; Burnett, M.; Yang, S.; Kwan, I.; Wong, W.K. Too much, too little, or just right? Ways explanations impact end users’ mental models. In Proceedings of the 2013 IEEE Symposium on Visual Languages and Human Centric Computing, San Jose, CA, USA, 15–19 September 2013; pp. 3–10. [Google Scholar]
- Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst. 2018, 31, 9525–9536. [Google Scholar]
- Hase, P.; Bansal, M. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? arXiv 2020, arXiv:2005.01831. [Google Scholar]
- Zhang, H.; Chen, J.; Xue, H.; Zhang, Q. Towards a unified evaluation of explanation methods without ground truth. arXiv 2019, arXiv:1911.09017. [Google Scholar]
- Bhatt, U.; Weller, A.; Moura, J.M. Evaluating and aggregating feature-based model explanations. arXiv 2020, arXiv:2005.00631. [Google Scholar]
- Sokol, K.; Flach, P. Explainability fact sheets: A framework for systematic assessment of explainable approaches. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 56–67. [Google Scholar]
- Atanasova, P.; Simonsen, J.G.; Lioma, C.; Augenstein, I. A diagnostic study of explainability techniques for text classification. arXiv 2020, arXiv:2009.13295. [Google Scholar]
- Liu, Y.; Khandagale, S.; White, C.; Neiswanger, W. Synthetic benchmarks for scientific research in explainable machine learning. arXiv 2021, arXiv:2106.12543. [Google Scholar]
- Agarwal, C.; Krishna, S.; Saxena, E.; Pawelczyk, M.; Johnson, N.; Puri, I.; Zitnik, M.; Lakkaraju, H. Openxai: Towards a transparent evaluation of model explanations. Adv. Neural Inf. Process. Syst. 2022, 35, 15784–15799. [Google Scholar]
- Belaid, M.K.; Hüllermeier, E.; Rabus, M.; Krestel, R. Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark. arXiv 2022, arXiv:2207.14160. [Google Scholar]
- Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Naser, M. An engineer’s guide to eXplainable Artificial Intelligence and Interpretable Machine Learning: Navigating causality, forced goodness, and the false perception of inference. Autom. Constr. 2021, 129, 103821. [Google Scholar] [CrossRef]
- Scikit Learn. Toy Datasets. Available online: https://scikit-learn.org/stable/datasets/toy_dataset.html (accessed on 29 October 2024).
- Wolberg, W.; Mangasarian, O.; Street, N.; Street, W. Wisconsin Diagnostic Breast Cancer Database; UCI Machine Learning Repository. 1993. Available online: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (accessed on 28 October 2024). [CrossRef]
- Garris, M.D.; Blue, J.L.; Candela, G.T.; Grother, P.J.; Janet, S.; Wilson, C.L. NIST Form-Based Handprint Recognition System; National Institute of Standards and Technology: Gaithersburg, MD, USA, 1997. [Google Scholar]
- Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, J.; Reis, J. Wine Quality. UCI Machine Learning Repository. 2009. Available online: https://archive.ics.uci.edu/dataset/186/wine+quality (accessed on 28 October 2024). [CrossRef]
- Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Entity Type | Names |
---|---|
Datasets | breast cancer, digits, wine, iris, synthetic, synthetic noisy |
Models | SVC, MLPClassifier, KNeighborsClassifier |
Feature importance explainers | Constant, LIME, Random, Shap |
Example selection explainers | Constant, KNN, Random |
Feature importance metrics | MRC, SNC, LD, DMA, SP, CVR |
Example selection metrics | MRC, SNC, TGD, SCC, CVR |
Cases | Correctness, Continuity, Contrastivity, Covariate Complexity, Compactness, Coherence |
Method | (MRC ↑) | (SNC ↓) | (LD ↑) | (DMA ↓) | (SP ↑) | (CVR ↑) |
---|---|---|---|---|---|---|
Const | 0.00 | 0.00 | 0.00 | 1.80 | 0.00 | 0.00 |
Random | 1.78 | 1.79 | 1.77 | 1.81 | 0.31 | 56.38 |
shap | 0.88 | 0.23 | 1.27 | 1.16 | 0.11 | 71.06 |
LIME | 0.99 | 0.48 | 0.86 | 1.28 | 0.16 | 69.51 |
Method | (MRC ↑) | (SNC ↓) | (TGD ↑) | (SCC ↓) | (CVR ↑) |
---|---|---|---|---|---|
Const | 1.00 | 1.00 | 0.18 | 0.25 | 12.98 |
Random | 0.00 | 0.00 | 0.28 | 0.37 | 16.38 |
KNN (l2) | 0.98 | 0.62 | 0.63 | 0.65 | 16.80 |
KNN (cos) | 1.00 | 0.63 | 0.63 | 0.65 | 16.84 |
Benchmark | Saliency Eval [17] | XAI-Bench [18] | OpenXAI [19] | Compare-xAI [20] | [21] | XAIB (Ours) |
---|---|---|---|---|---|---|
Code publication date | Sep 2020 | Jun 2021 | Jun 2022 | Mar 2022 | Dec 2022 | Oct 2022 |
Use of ground truth | Human annotations | Synthetic | Logistic regression coefficients | No | No, pseudo, synthentic | No |
Documentation | No | Reproduce | Use | Reproduce, add explainer, add test | Reproduce, sse | Reproduce, use, add dataset, add model, add explainer, add metric |
Explanation types | Feature importance | Feature importance | Feature importance | Feature importance | Feature importance | Feature importance, example-based |
Data types | Text | Tabular | Tabular | Tabular | Text, image | Tabular, more will be implemented |
ML tasks | Classification | Classification | Classification | Classification | Sentiment, classification | Classification, more will be implemented |
Results | In paper | In paper | Online | Online | In paper | Online |
Applicability | Specific datasets | Synthetic tests | Specific datasets/models | Hardcoded tests | Datasets and models compatible with PaddlePaddle | Any compatible datasets/models |
Versioning | No | No | SemVer | No | SemVer (as a part of InterpretDL) | SemVer |
Distribution | No | No | No | No | PyPI (as a part of InterpretDL) | PyPI |
Benchmark | Saliency Eval [17] | XAI-Bench [18] | OpenXAI [19] | Compare-xAI [20] | [21] | XAIB (Ours) |
---|---|---|---|---|---|---|
Correctness | Yes | Yes | Yes | Yes | Yes | |
Continuity | Yes | Yes | Yes | |||
Coherence | Yes | Yes | Yes | Yes | ||
Completeness | Yes | |||||
Covariate Complexity | Yes | |||||
Compactness | Yes | |||||
Contrastivity | Yes | Yes | ||||
Consistency | Yes | |||||
Context | Yes | Yes | ||||
Confidence | ||||||
Composition | ||||||
Controllability |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moiseev, I.; Balabaeva, K.; Kovalchuk, S. Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms 2025, 18, 85. https://doi.org/10.3390/a18020085
Moiseev I, Balabaeva K, Kovalchuk S. Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms. 2025; 18(2):85. https://doi.org/10.3390/a18020085
Chicago/Turabian StyleMoiseev, Ilia, Ksenia Balabaeva, and Sergey Kovalchuk. 2025. "Open and Extensible Benchmark for Explainable Artificial Intelligence Methods" Algorithms 18, no. 2: 85. https://doi.org/10.3390/a18020085
APA StyleMoiseev, I., Balabaeva, K., & Kovalchuk, S. (2025). Open and Extensible Benchmark for Explainable Artificial Intelligence Methods. Algorithms, 18(2), 85. https://doi.org/10.3390/a18020085