A Brief Systematization of Explanation-Aware Attacks

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14992 ))

Included in the following conference series:

German Conference on Artificial Intelligence (Künstliche Intelligenz)

Abstract

Due to the overabundance of trained parameters modern machine learning models are largely considered black boxes. Explanation methods aim to shed light on the inner working of such models, and, thus can serve as debugging tools. However, recent research has demonstrated that carefully crafted manipulations at the input or the model can successfully fool the model and the explanation method. In this work, we briefly present our systematization of such explanation-aware attacks. We categorize them according to three distinct attack types, three types of scopes, and three different capabilities an adversary can have. In our full paper [12], we further present a hierarchy of robustness notion and various defensive techniques tailored toward explanation-aware attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aïvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S., Tapp, A.: Fairwashing: the risk of rationalization. In: Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97 (2019)
Google Scholar
Aïvodji, U., Arai, H., Gambs, S., Hara, S.: Characterizing the risk of fairwashing. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Anders, C.J., Pasliev, P., Dombrowski, A.K., Müller, K.R., Kessel, P.: Fairwashing explanations with off-manifold detergent. In: Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119 (2020)
Google Scholar
Baniecki, H., Biecek, P.: Adversarial attacks and defenses in explainable artificial intelligence: a survey. In: Proceedings of the IJCAI Workshop of Explainable AI (XAI) (2023)
Google Scholar
Dombrowski, A.K., Alber, M., Anders, C., Ackermann, M., Müller, K.R., Kessel, P.: Explanations can be manipulated and geometry is to blame. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Fang, S., Choromanska, A.: Backdoor attacks on the DNN interpretation system. In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2022)
Google Scholar
Heo, J., Joo, S., Moon, T.: Fooling neural network interpretations via adversarial model manipulation. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Ivankay, A., Girardi, I., Frossard, P., Marchiori, C.: Fooling explanations in text classifiers. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Lakkaraju, H., Bastani, O.: “How do I fool you?”: manipulating user trust via misleading black box explanations. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) (2020)
Google Scholar
Noppel, M., Peter, L., Wressnegger, C.: Disguising attacks with explanation-aware backdoors. In: Proceedings of the IEEE Symposium on Security and Privacy (S &P) (2023)
Google Scholar
Noppel, M., Wressnegger, C.: Explanation-aware backdoors in a nutshell. In: Proceedings of the German Conference on Artificial Intelligence (KI) (2023)
Google Scholar
Noppel, M., Wressnegger, C.: SoK: explainable machine learning in adversarial environments. In: Proceedings of the IEEE Symposium on Security and Privacy (S &P) (2024)
Google Scholar
Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling LIME and SHAP: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) (2020)
Google Scholar
Zhang, X., Wang, N., Shen, H., Ji, S., Luo, X., Wang, T.: Interpretable deep learning under fire. In: Proceedings of the USENIX Security Symposium (2020)
Google Scholar

Download references

Acknowledgement

The authors gratefully acknowledge funding from the German Federal Ministry of Education and Research (BMBF) under the project DataChainSec (FKZ 16KIS1700) and by the Helmholtz Association (HGF) within topic “46.23 Engineering Secure Systems.”

Author information

Authors and Affiliations

KASTEL Security Research Labs, Karlsruhe Institute of Technology, Karlsruhe, Germany
Maximilian Noppel & Christian Wressnegger

Authors

Maximilian Noppel
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wressnegger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maximilian Noppel .

Editor information

Editors and Affiliations

Julius-Maximilians-Universität Würzburg, Würzburg, Germany
Andreas Hotho
TU Dresden, Dresden, Germany
Sebastian Rudolph

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Noppel, M., Wressnegger, C. (2024). A Brief Systematization of Explanation-Aware Attacks. In: Hotho, A., Rudolph, S. (eds) KI 2024: Advances in Artificial Intelligence. KI 2024. Lecture Notes in Computer Science(), vol 14992 . Springer, Cham. https://doi.org/10.1007/978-3-031-70893-0_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-70893-0_30
Published: 30 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70892-3
Online ISBN: 978-3-031-70893-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics