Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

Sandipan Sikdar, Parantapa Bhattacharya, Kieran Heese

Abstract

In this paper, we introduce Integrated Directional Gradients (IDG), a method for attributing importance scores to groups of features, indicating their relevance to the output of a neural network model for a given input. The success of Deep Neural Networks has been attributed to their ability to capture higher level feature interactions. Hence, in the last few years capturing the importance of these feature interactions has received increased prominence in ML interpretability literature. In this paper, we formally define the feature group attribution problem and outline a set of axioms that any intuitive feature group attribution method should satisfy. Earlier, cooperative game theory inspired axiomatic methods only borrowed axioms from solution concepts (such as Shapley value) for individual feature attributions and introduced their own extensions to model interactions. In contrast, our formulation is inspired by axioms satisfied by characteristic functions as well as solution concepts in cooperative game theory literature. We believe that characteristic functions are much better suited to model importance of groups compared to just solution concepts. We demonstrate that our proposed method, IDG, satisfies all the axioms. Using IDG we analyze two state-of-the-art text classifiers on three benchmark datasets for sentiment analysis. Our experiments show that IDG is able to effectively capture semantic interactions in linguistic models via negations and conjunctions.

Anthology ID:: 2021.acl-long.71
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 865–878
Language:
URL:: https://aclanthology.org/2021.acl-long.71
DOI:: 10.18653/v1/2021.acl-long.71
Bibkey:
Cite (ACL):: Sandipan Sikdar, Parantapa Bhattacharya, and Kieran Heese. 2021. Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, Online. Association for Computational Linguistics.
Cite (Informal):: Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models (Sikdar et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-long.71.pdf
Video:: https://aclanthology.org/2021.acl-long.71.mp4
Code: parantapa/integrated-directional-gradients
Data: IMDb Movie Reviews, SST

PDF Cite Search Code Video