Abstract
Due to the overabundance of trained parameters modern machine learning models are largely considered black boxes. Explanation methods aim to shed light on the inner working of such models, and, thus can serve as debugging tools. However, recent research has demonstrated that carefully crafted manipulations at the input or the model can successfully fool the model and the explanation method. In this work, we briefly present our systematization of such explanation-aware attacks. We categorize them according to three distinct attack types, three types of scopes, and three different capabilities an adversary can have. In our full paper [12], we further present a hierarchy of robustness notion and various defensive techniques tailored toward explanation-aware attacks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aïvodji, U., Arai, H., Fortineau, O., Gambs, S., Hara, S., Tapp, A.: Fairwashing: the risk of rationalization. In: Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97 (2019)
Aïvodji, U., Arai, H., Gambs, S., Hara, S.: Characterizing the risk of fairwashing. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
Anders, C.J., Pasliev, P., Dombrowski, A.K., Müller, K.R., Kessel, P.: Fairwashing explanations with off-manifold detergent. In: Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119 (2020)
Baniecki, H., Biecek, P.: Adversarial attacks and defenses in explainable artificial intelligence: a survey. In: Proceedings of the IJCAI Workshop of Explainable AI (XAI) (2023)
Dombrowski, A.K., Alber, M., Anders, C., Ackermann, M., Müller, K.R., Kessel, P.: Explanations can be manipulated and geometry is to blame. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Fang, S., Choromanska, A.: Backdoor attacks on the DNN interpretation system. In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2022)
Heo, J., Joo, S., Moon, T.: Fooling neural network interpretations via adversarial model manipulation. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Ivankay, A., Girardi, I., Frossard, P., Marchiori, C.: Fooling explanations in text classifiers. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)
Lakkaraju, H., Bastani, O.: “How do I fool you?”: manipulating user trust via misleading black box explanations. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) (2020)
Noppel, M., Peter, L., Wressnegger, C.: Disguising attacks with explanation-aware backdoors. In: Proceedings of the IEEE Symposium on Security and Privacy (S &P) (2023)
Noppel, M., Wressnegger, C.: Explanation-aware backdoors in a nutshell. In: Proceedings of the German Conference on Artificial Intelligence (KI) (2023)
Noppel, M., Wressnegger, C.: SoK: explainable machine learning in adversarial environments. In: Proceedings of the IEEE Symposium on Security and Privacy (S &P) (2024)
Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H.: Fooling LIME and SHAP: adversarial attacks on post hoc explanation methods. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) (2020)
Zhang, X., Wang, N., Shen, H., Ji, S., Luo, X., Wang, T.: Interpretable deep learning under fire. In: Proceedings of the USENIX Security Symposium (2020)
Acknowledgement
The authors gratefully acknowledge funding from the German Federal Ministry of Education and Research (BMBF) under the project DataChainSec (FKZ 16KIS1700) and by the Helmholtz Association (HGF) within topic “46.23 Engineering Secure Systems.”
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Noppel, M., Wressnegger, C. (2024). A Brief Systematization of Explanation-Aware Attacks. In: Hotho, A., Rudolph, S. (eds) KI 2024: Advances in Artificial Intelligence. KI 2024. Lecture Notes in Computer Science(), vol 14992 . Springer, Cham. https://doi.org/10.1007/978-3-031-70893-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-70893-0_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70892-3
Online ISBN: 978-3-031-70893-0
eBook Packages: Computer ScienceComputer Science (R0)