Interpreting Pretrained Language Models via Concept Bottlenecks

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14647))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

984 Accesses

Abstract

Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their “black-box” nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of “Food” and investigate how it influences the prediction of a model’s sentiment towards a restaurant review. We introduce C$^3$M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we show that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Machine Learning Meets Natural Language Processing - The Story so Far

Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Notes

References

Abraham, E.D., et al.: Cebab: estimating the causal effects of real-world concepts on NLP model behavior. In: Advances in Neural Information Processing Systems, vol. 35, pp. 17582–17596 (2022)
Google Scholar
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Bills, S., et al.: Language models can explain neurons in language models (2023). https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Cai, H., Xia, R., Yu, J.: Aspect-category-opinion-sentiment quadruple extraction with implicit aspects and opinions. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Diao, S., et al.: Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531 (2022)
Englesson, E., Azizpour, H.: Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297 (2021)
Google Scholar
Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn.Syst. 32(10), 4291–4308 (2020)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kim, B., et al.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In: International Conference on Machine Learning, pp. 2668–2677. PMLR (2018)
Google Scholar
Kim, E., Klinger, R.: Who feels what and why? Annotation of a literature corpus with semantic roles of emotions. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1345–1359 (2018)
Google Scholar
Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020)
Google Scholar
Liu, Y., Cheng, H., Zhang, K.: Identifiability of label noise transition matrix. arXiv preprint arXiv:2202.02016 (2022)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Losch, M., Fritz, M., Schiele, B.: Interpretability beyond classification output: semantic bottleneck networks. arXiv preprint arXiv:1907.10882 (2019)
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)
Google Scholar
Madsen, A., Reddy, S., Chandar, S.: Post-hoc interpretability for neural NLP: a survey. ACM Comput. Surv. 55(8), 1–42 (2022)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Németh, R., Sik, D., Máté, F.: Machine learning of concepts hard even for humans: the case of online depression forums. Int. J. Qual. Methods 19, 1609406920949338 (2020)
Article Google Scholar
Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.-W.: Label-free concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
OpenAI. Gpt-4 Technical report (2023)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NeurIPS (2017)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Ross, A., Marasović, A., Peters, M.E.: Explaining NLP models via minimal contrastive editing (mice). In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3840–3852 (2021)
Google Scholar
Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems, vol. 33, pp. 596–608 (2020)
Google Scholar
Vig, J., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12388–12401 (2020)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2020)
Google Scholar
Yang, J., Zhang, Y., Li, L., Li, X.: Yedda: a lightweight collaborative text span annotation tool. arXiv preprint arXiv:1711.03759 (2017)
Yin, K., Neubig, G.: Interpreting language models with contrastive explanations. arXiv preprint arXiv:2202.10419 (2022)
Zarlenga, M.E., et al.: Concept embedding models. In: NeurIPS 2022 - 36th Conference on Neural Information Processing Systems (2022)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, W., Li, X., Deng, Y., Bing, L., Lam, W.: A survey on aspect-based sentiment analysis: tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. (2022)
Google Scholar
Zhu, J., et al.: Incorporating BERT into neural machine translation. In: International Conference on Learning Representations (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Science Foundation (NSF) under grants IIS-2229461.

Author information

Authors and Affiliations

Arizona State University, Tempe, AZ, USA
Zhen Tan & Huan Liu
University of Illinois Chicago, Chicago, IL, USA
Lu Cheng
University of Virginia, Charlottesville, VA, USA
Song Wang & Jundong Li
Zhejiang University, Zhejiang, China
Bo Yuan

Authors

Zhen Tan
View author publications
You can also search for this author in PubMed Google Scholar
Lu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Song Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Jundong Li
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Tan .

Editor information

Editors and Affiliations

Academia Sinica, Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Appendices

A Definitions of Training Strategies

Given a text input $x \in \mathbb {R}^d$, concepts $c\in \mathbb {R}^k$ and its label y, the strategies for fine-tuning the text encoder $f_\theta $, the projector $p_\psi $ and the label predictor $g_\phi $ are defined as follows:

i) Vanilla fine-tuning a PLM: The concept labels are ignored, and then the text encoder $f_\theta $ and the label predictor $g_\phi $ are fine-tuned either as follows:

$$\begin{aligned} \theta , \phi = \textrm{argmin}_{\theta , \phi } L_{CE} (g_\phi (f_\theta (x), y), \end{aligned}$$

or as follows (frozen text encoder $f_\theta $):

$$\begin{aligned} \phi = \textrm{argmin}_{\phi } L_{CE} (g_\phi (f_\theta (x), y), \end{aligned}$$

where $L_{CE}$ indicates the cross-entropy loss. In this work we only consider the former option for its significant better performance.

ii) Independently training PLM with the concept and task labels: The text encoder $f_\theta $, the projector $p_\psi $ and the label predictor $g_\phi $ are trained separately with ground truth concepts labels and task labels as follows:

$$\begin{aligned} \begin{aligned} \theta , \psi &= \textrm{argmin}_{\theta , \psi } L_{CE} (p_\psi (f_\theta (x)),c), \\ \phi &= \textrm{argmin}_{\phi } L_{CE} (g_{\phi }(c),y). \end{aligned} \end{aligned}$$

During inference, the label predictor will use the output from the projector rather than the ground-truth concepts.

iii) Sequentially training PLM with the concept and task labels: We first learn the concept encoder as the independent training strategy above, and then use its output to train the label predictor:

$$\begin{aligned} \begin{aligned} \phi = \textrm{argmin}_{\phi } L_{CE} (g_{\phi }(p_\psi (f_\theta (x),y). \end{aligned} \end{aligned}$$

iv) Jointly training PLM with the concept and task labels: Learn the concept encoder and label predictor via a weighted sum $L_{joint}$ of the two objectives described above:

$$\begin{aligned} \begin{aligned} \theta , \psi , \phi &= \textrm{argmin}_{\theta , \psi , \phi } L_{joint}(x, c, y) \\ {} &= \textrm{argmin}_{\theta , \psi , \phi } [L_{CE} (g_{\phi }(p_\psi (f_\theta (x),y) \\ {} &+ \gamma L_{CE} (p_\psi (f_\theta (x)),c)]. \end{aligned} \end{aligned}$$

It’s worth noting that the CBE-PLMs trained jointly are sensitive to the loss weight $\gamma $. We report the most effective results here, tested value for $\gamma $ are given in Table 2 in Appendix D.

B Details of the Manual Concept Annotation for the IMDB Dataset

Our annotation policy is following a previous work [5] for NLP datasets annotating. For the IMDB-C dataset, we annotate the four concepts (Acting, Stroyline, Emotional Arousal, Cinematography) manually. Even though the concepts are naturally understandable by humans, two Master students familiar with sentiment analysis are selected as annotators for independent annotation with the annotation tool introduced by [30]. The strict quadruple matching F1 score between two annotators is $85.74\%$, which indicates a consistent agreement between the two annotators [13]. In case of disagreement, a third expert will be asked to make the final decision.

C Implementation Detail

In this section, we provide more details on the implementation settings of our experiments. Specifically, we implement our framework with PyTorch [24] and HuggingFace [29] and train our framework on a single 80 GB Nvidia A100 GPU. We follow a prior work [1] for backbone implementation. All backbone models have a maximum token number of 512 and a batch size of 8. We use the Adam optimizer to update the backbone, projector, and label predictor according to Sect. 3.1. The values of other hyperparameters (Table 2 in Appendix D) for each specific PLM type are determined through grid search. We run all the experiments on an Nvidia A100 GPU with 80 GB RAM.

D Parameters and Notations

In this section, we provide used notations in this paper along with their descriptions for comprehensive understanding. We also list their experimented values and optimal ones, as shown in Table 2.

Table 2. Key parameters in this paper with their annotations and evaluated values. Note that bold values indicate the optimal ones.

Full size table

E Statistics of Data Splits

The Statistics and split policies of the experimented datasets, including the source concept dataset $\mathcal {D}_s$, the unlabeled concept dataset $\mathcal {D}_u$, and their augmented versions. The specific details are presented in Table 3.

Table 3. Statistics of experimented datasets. k denotes the number of concepts.

Full size table

F Statistics of Concepts in Transformed Datasets

The Statistics and split policies of the transformed datasets of experimented datasets are presented in Table 4.

Table 4. Statistics of concepts in transformed datasets ($\tilde{\mathcal {D}}$). Human-specified concepts are underlined. Concepts shown in gray are not used in experiments as the portion of the “Unknown” label is too large.

Full size table

G More Results on Explainable Predictions

Case studies on explainable predictions for both CEBaB and IMDB-C datasets are given in Fig. 5 and Fig. 6 respectively.

H A Case Study on Test-Time Intervention

We present a case study of Test-time Intervention using an example from the transformed unlabeled concept data $\tilde{\mathcal {D}}_u$ of the CEBaB dataset, as shown in Fig. 7. The first row displays the target concept labels generated by ChatGPT. The second row shows the predictions from the trained CBE-PLM model, which mispredicts two concepts ("Waiting time" and "Waiting area"). The third row demonstrates test-time intervention using ChatGPT as the oracle, which corrects the predicted task labels. Finally, the fourth row implements test-time intervention with a human oracle, rectifying the concept that ChatGPT originally mislabeled.

I Examples of Querying ChatGPT

In this paper, we query ChatGPT for 1) augmenting the concept set, and 2) annotate missing concept labels. Note that in practice, we query ChatGPT (GPT4) via OpenAI API. Here we demonstrate examples from the ChatGPT (GPT4) GUI for better illustration. The illustrations are given in Fig. 8 and Fig. 9.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, Z., Cheng, L., Wang, S., Yuan, B., Li, J., Liu, H. (2024). Interpreting Pretrained Language Models via Concept Bottlenecks. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_5

Download citation

DOI: https://doi.org/10.1007/978-981-97-2259-4_5
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Interpreting Pretrained Language Models via Concept Bottlenecks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Learning Meets Natural Language Processing - The Story so Far

Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Definitions of Training Strategies

B Details of the Manual Concept Annotation for the IMDB Dataset

C Implementation Detail

D Parameters and Notations

E Statistics of Data Splits

F Statistics of Concepts in Transformed Datasets

G More Results on Explainable Predictions

H A Case Study on Test-Time Intervention

I Examples of Querying ChatGPT

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us