Abstract
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their “black-box” nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of “Food” and investigate how it influences the prediction of a model’s sentiment towards a restaurant review. We introduce C\(^3\)M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we show that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abraham, E.D., et al.: Cebab: estimating the causal effects of real-world concepts on NLP model behavior. In: Advances in Neural Information Processing Systems, vol. 35, pp. 17582–17596 (2022)
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Bills, S., et al.: Language models can explain neurons in language models (2023). https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Cai, H., Xia, R., Yu, J.: Aspect-category-opinion-sentiment quadruple extraction with implicit aspects and opinions. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Diao, S., et al.: Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531 (2022)
Englesson, E., Azizpour, H.: Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297 (2021)
Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn.Syst. 32(10), 4291–4308 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kim, B., et al.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In: International Conference on Machine Learning, pp. 2668–2677. PMLR (2018)
Kim, E., Klinger, R.: Who feels what and why? Annotation of a literature corpus with semantic roles of emotions. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1345–1359 (2018)
Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020)
Liu, Y., Cheng, H., Zhang, K.: Identifiability of label noise transition matrix. arXiv preprint arXiv:2202.02016 (2022)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Losch, M., Fritz, M., Schiele, B.: Interpretability beyond classification output: semantic bottleneck networks. arXiv preprint arXiv:1907.10882 (2019)
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)
Madsen, A., Reddy, S., Chandar, S.: Post-hoc interpretability for neural NLP: a survey. ACM Comput. Surv. 55(8), 1–42 (2022)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Németh, R., Sik, D., Máté, F.: Machine learning of concepts hard even for humans: the case of online depression forums. Int. J. Qual. Methods 19, 1609406920949338 (2020)
Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.-W.: Label-free concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2023)
OpenAI. Gpt-4 Technical report (2023)
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NeurIPS (2017)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Ross, A., Marasović, A., Peters, M.E.: Explaining NLP models via minimal contrastive editing (mice). In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3840–3852 (2021)
Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems, vol. 33, pp. 596–608 (2020)
Vig, J., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12388–12401 (2020)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2020)
Yang, J., Zhang, Y., Li, L., Li, X.: Yedda: a lightweight collaborative text span annotation tool. arXiv preprint arXiv:1711.03759 (2017)
Yin, K., Neubig, G.: Interpreting language models with contrastive explanations. arXiv preprint arXiv:2202.10419 (2022)
Zarlenga, M.E., et al.: Concept embedding models. In: NeurIPS 2022 - 36th Conference on Neural Information Processing Systems (2022)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, W., Li, X., Deng, Y., Bing, L., Lam, W.: A survey on aspect-based sentiment analysis: tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. (2022)
Zhu, J., et al.: Incorporating BERT into neural machine translation. In: International Conference on Learning Representations (2020)
Acknowledgements
This work is supported by the National Science Foundation (NSF) under grants IIS-2229461.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Definitions of Training Strategies
Given a text input \(x \in \mathbb {R}^d\), concepts \(c\in \mathbb {R}^k\) and its label y, the strategies for fine-tuning the text encoder \(f_\theta \), the projector \(p_\psi \) and the label predictor \(g_\phi \) are defined as follows:
i) Vanilla fine-tuning a PLM: The concept labels are ignored, and then the text encoder \(f_\theta \) and the label predictor \(g_\phi \) are fine-tuned either as follows:
or as follows (frozen text encoder \(f_\theta \)):
where \(L_{CE}\) indicates the cross-entropy loss. In this work we only consider the former option for its significant better performance.
ii) Independently training PLM with the concept and task labels: The text encoder \(f_\theta \), the projector \(p_\psi \) and the label predictor \(g_\phi \) are trained separately with ground truth concepts labels and task labels as follows:
During inference, the label predictor will use the output from the projector rather than the ground-truth concepts.
iii) Sequentially training PLM with the concept and task labels: We first learn the concept encoder as the independent training strategy above, and then use its output to train the label predictor:
iv) Jointly training PLM with the concept and task labels: Learn the concept encoder and label predictor via a weighted sum \(L_{joint}\) of the two objectives described above:
It’s worth noting that the CBE-PLMs trained jointly are sensitive to the loss weight \(\gamma \). We report the most effective results here, tested value for \(\gamma \) are given in Table 2 in Appendix D.
B Details of the Manual Concept Annotation for the IMDB Dataset
Our annotation policy is following a previous work [5] for NLP datasets annotating. For the IMDB-C dataset, we annotate the four concepts (Acting, Stroyline, Emotional Arousal, Cinematography) manually. Even though the concepts are naturally understandable by humans, two Master students familiar with sentiment analysis are selected as annotators for independent annotation with the annotation tool introduced by [30]. The strict quadruple matching F1 score between two annotators is \(85.74\%\), which indicates a consistent agreement between the two annotators [13]. In case of disagreement, a third expert will be asked to make the final decision.
C Implementation Detail
In this section, we provide more details on the implementation settings of our experiments. Specifically, we implement our framework with PyTorch [24] and HuggingFace [29] and train our framework on a single 80 GB Nvidia A100 GPU. We follow a prior work [1] for backbone implementation. All backbone models have a maximum token number of 512 and a batch size of 8. We use the Adam optimizer to update the backbone, projector, and label predictor according to Sect. 3.1. The values of other hyperparameters (Table 2 in Appendix D) for each specific PLM type are determined through grid search. We run all the experiments on an Nvidia A100 GPU with 80 GB RAM.
D Parameters and Notations
In this section, we provide used notations in this paper along with their descriptions for comprehensive understanding. We also list their experimented values and optimal ones, as shown in Table 2.
E Statistics of Data Splits
The Statistics and split policies of the experimented datasets, including the source concept dataset \(\mathcal {D}_s\), the unlabeled concept dataset \(\mathcal {D}_u\), and their augmented versions. The specific details are presented in Table 3.
F Statistics of Concepts in Transformed Datasets
The Statistics and split policies of the transformed datasets of experimented datasets are presented in Table 4.
G More Results on Explainable Predictions
Case studies on explainable predictions for both CEBaB and IMDB-C datasets are given in Fig. 5 and Fig. 6 respectively.
H A Case Study on Test-Time Intervention
We present a case study of Test-time Intervention using an example from the transformed unlabeled concept data \(\tilde{\mathcal {D}}_u\) of the CEBaB dataset, as shown in Fig. 7. The first row displays the target concept labels generated by ChatGPT. The second row shows the predictions from the trained CBE-PLM model, which mispredicts two concepts ("Waiting time" and "Waiting area"). The third row demonstrates test-time intervention using ChatGPT as the oracle, which corrects the predicted task labels. Finally, the fourth row implements test-time intervention with a human oracle, rectifying the concept that ChatGPT originally mislabeled.
I Examples of Querying ChatGPT
In this paper, we query ChatGPT for 1) augmenting the concept set, and 2) annotate missing concept labels. Note that in practice, we query ChatGPT (GPT4) via OpenAI API. Here we demonstrate examples from the ChatGPT (GPT4) GUI for better illustration. The illustrations are given in Fig. 8 and Fig. 9.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tan, Z., Cheng, L., Wang, S., Yuan, B., Li, J., Liu, H. (2024). Interpreting Pretrained Language Models via Concept Bottlenecks. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_5
Download citation
DOI: https://doi.org/10.1007/978-981-97-2259-4_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)