research-article

Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks

Authors:

Yi ChangAuthors Info & Claims

WWW '21: Proceedings of the Web Conference 2021

Pages 2466 - 2475

https://doi.org/10.1145/3442381.3449988

Published: 03 June 2021 Publication History

Abstract

We study the problem of incorporating prior knowledge into a deep Transformer-based model, i.e., Bidirectional Encoder Representations from Transformers (BERT), to enhance its performance on semantic textual matching tasks. By probing and analyzing what BERT has already known when solving this task, we obtain better understanding of what task-specific knowledge BERT needs the most and where it is most needed. The analysis further motivates us to take a different approach than most existing works. Instead of using prior knowledge to create a new training task for fine-tuning BERT, we directly inject knowledge into BERT’s multi-head attention mechanism. This leads us to a simple yet effective approach that enjoys fast training stage as it saves the model from training on additional data or tasks other than the main task. Extensive experiments demonstrate that the proposed knowledge-enhanced BERT is able to consistently improve semantic textual matching performance over the original BERT model, and the performance benefit is most salient when training data is scarce.

References

[1]

Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J. Mooney. 2015. Representing Meaning with a Combination of Logical Form and Vectors. CoRR abs/1505.06816(2015).

[2]

Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270.

[3]

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055(2017).

[4]

Jindong Chen, Yizhou Hu, Jingping Liu, Yanghua Xiao, and Haiyun Jiang. 2019. Deep short text classification with knowledge powered attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6252–6259.

Digital Library

[5]

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017). ACL, Vancouver, 1657–1668.

[6]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. 2018. Neural Natural Language Inference Models Enhanced with External Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2406–2417.

[7]

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz, Jianfeng Gao, and Li Deng. 2016. Knowledge as a teacher: Knowledge-guided structural attention networks. arXiv preprint arXiv:1609.03286(2016).

[8]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. In BlackBoxNLP@ACL.

[9]

Dipanjan Das and Noah A Smith. 2009. Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 468–476.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805(2018).

[11]

William B Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).

[12]

Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8 (2020), 34–48.

[13]

Samuel Fernando and Mark Stevenson. 2008. A semantic similarity approach to paraphrase detection. In Proceedings of the 11th annual research colloquium of the UK special interest group for computational linguistics. 45–52.

[14]

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8342–8360.

[15]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arxiv:2006.03654 [cs.CL]

[16]

John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4129–4138.

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[18]

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems. 2042–2050.

[19]

Adrian Iftene and Alexandra Balahur. 2007. Hypothesis transformation and semantic variability rules used in recognizing textual entailment. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. 125–130.

[20]

Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. 2017. First quora dataset release: Question pairs. URL https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs(2017).

[21]

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving Pre-training by Representing and Predicting Spans. arXiv preprint arXiv:1907.10529(2019).

[22]

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A Continuously Growing Dataset of Sentential Paraphrases. In Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP). Association for Computational Linguistics, 1235–1245.

[23]

Wuwei Lan and Wei Xu. 2018. Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 3890–3902.

[24]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.

[25]

Guanyu Li, Pengfei Zhang, and Caiyan Jia. 2018. Attention Boosted Sequential Inference Model. CoRR abs/1812.01840(2018).

[26]

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-BERT: Enabling Language Representation with Knowledge Graph. In Proceedings of AAAI 2020.

[27]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692(2019).

[28]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[29]

Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In EMNLP. The Association for Computational Linguistics, 2249–2255.

[30]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[31]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT. 2227–2237.

[32]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.

[33]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: What we know about how bert works. arXiv preprint arXiv:2002.12327(2020).

[34]

Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2014. Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence. Transactions of the Association for Computational Linguistics 2 (2014), 219–230.

[35]

Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2015. Dls@ cu: Sentence similarity from word alignment and semantic vector composition. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 148–153.

[36]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4593–4601.

[37]

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, 2019. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316(2019).

[38]

Gaurav Singh Tomar, Thyago Duque, Oscar Täckström, Jakob Uszkoreit, and Dipanjan Das. 2017. Neural Paraphrase Identification of Questions with Noisy Pretraining. In SWCN@EMNLP. Association for Computational Linguistics, 142–147.

[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[40]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In the Proceedings of ICLR.

[41]

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 4144–4150.

[42]

Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 133–138.

Digital Library

[43]

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848(2019).

[44]

Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji. 2014. Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics 2 (2014), 435–448.

[45]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5753–5763.

[46]

Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics 4 (2016), 259–272.

[47]

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of ACL 2019.

[48]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-Aware BERT for Language Understanding. In AAAI. AAAI Press, 9628–9635.

Cited By

Sun XSong YHuang J(2024)Second-Order Text Matching Algorithm for Agricultural TextApplied Sciences10.3390/app1416701214:16(7012)Online publication date: 9-Aug-2024
https://doi.org/10.3390/app14167012
Li LLiao QLai MLiang DLiang S(2024)Local and Global: Text Matching Via Syntax Graph CalibrationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446461(11571-11575)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446461
Zhao YXia TJiang YTian Y(2024)Enhancing inter-sentence attention for Semantic Textual SimilarityInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10353561:1Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.ipm.2023.103535
Show More Cited By

Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Adapting prior knowledge activation: Mobilisation, perspective taking, and learners' prior knowledge

This study investigates the effects of two prior knowledge activation strategies, namely, mobilisation and perspective taking, on learning. It is hypothesised that the effectiveness of these strategies is influenced by learners' prior domain knowledge. ...
Improving Logitboost with prior knowledge

The purpose of this study is to incorporate prior knowledge into a boosting algorithm. Existing approaches require additional samples that represent the prior knowledge. Moreover, in order to adjust the balance between the information in training ...
Semantic Matching Algorithm for Legal Issues Based on BERT and Graph Convolution with Multi-granularity Features
ICCPR '23: Proceedings of the 2023 12th International Conference on Computing and Pattern Recognition

In recent years, pre-training models represented by BERT have shown amazing text semantic representation capabilities supported by large-scale corpus, and now have become one of the mainstream solutions for semantic matching tasks in legal Q&A systems. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '21: Proceedings of the Web Conference 2021

April 2021

4054 pages

ISBN:9781450383127

DOI:10.1145/3442381

Editors:
Jure Leskovec
Stanford
,
Marko Grobelnik
Jožef Stefan Institute
,
Marc Najork
Google
,
Jie Tang
Tsinghua University
,
Leila Zia
Wikimedia Foundation

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '21

Sponsor:

SIGWEB

WWW '21: The Web Conference 2021

April 19 - 23, 2021

Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)8

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sun XSong YHuang J(2024)Second-Order Text Matching Algorithm for Agricultural TextApplied Sciences10.3390/app1416701214:16(7012)Online publication date: 9-Aug-2024
https://doi.org/10.3390/app14167012
Li LLiao QLai MLiang DLiang S(2024)Local and Global: Text Matching Via Syntax Graph CalibrationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446461(11571-11575)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446461
Zhao YXia TJiang YTian Y(2024)Enhancing inter-sentence attention for Semantic Textual SimilarityInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10353561:1Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.ipm.2023.103535
Hu JZhu YWu LLuo QTeng FLi T(2024)Text semantic matching algorithm based on the introduction of external knowledge under contrastive learningInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02285-2Online publication date: 24-Jul-2024
https://doi.org/10.1007/s13042-024-02285-2
Li QZhang Y(2023)Improved Text Matching Model Based on BERTFrontiers in Computing and Intelligent Systems10.54097/fcis.v2i3.52092:3(40-43)Online publication date: 13-Feb-2023
https://doi.org/10.54097/fcis.v2i3.5209
Bu WShu HKang FHu QZhao Y(2023)Software Subclassification Based on BERTopic-BERT-BiLSTM ModelElectronics10.3390/electronics1218379812:18(3798)Online publication date: 8-Sep-2023
https://doi.org/10.3390/electronics12183798
Reig Alamillo ATorres Moreno DMorales González EToledo Acosta MTaroni AHermosillo Valadez J(2023)The Analysis of Synonymy and Antonymy in Discourse Relations: An Interpretable Modeling ApproachComputational Linguistics10.1162/coli_a_0047749:2(429-464)Online publication date: 1-Jun-2023
https://doi.org/10.1162/coli_a_00477
Xu SPang LShen HCheng X(2023)NIR-Prompt: A Multi-task Generalized Neural Information Retrieval Training FrameworkACM Transactions on Information Systems10.1145/362609242:2(1-32)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1145/3626092
Cao YHe LWu ZDai X(2023)Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191265(1-9)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191265
Xue CLiang DWang SZhang JWu W(2023)Dual Path Modeling for Semantic Matching by Perceiving Subtle ConflictsICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096590(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096590
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents