Abstract
Transformers have been widely studied in many natural language processing (NLP) tasks, which can capture the dependency from the whole sentence with a high parallelizability thanks to the multi-head attention and the position-wise feed-forward network. However, the above two components of transformers are position-independent, which causes transformers to be weak in modeling sentence structures. Existing studies commonly utilized positional encoding or mask strategies for capturing the structural information of sentences. In this paper, we aim at strengthening the ability of transformers on modeling the linear structure of sentences from three aspects, containing the absolute position of tokens, the relative distance, and the direction between tokens. We propose a novel bidirectional Transformer with absolute-position aware relative position encoding (BiAR-Transformer) that combines the positional encoding and the mask strategy together. We model the relative distance between tokens along with the absolute position of tokens by a novel absolute-position aware relative position encoding. Meanwhile, we apply a bidirectional mask strategy for modeling the direction between tokens. Experimental results on the natural language inference, paraphrase identification, sentiment classification and machine translation tasks show that BiAR-Transformer achieves superior performance than other strong baselines.
Similar content being viewed by others
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 5998–6008
Guo M, Zhang Y, Liu T. Gaussian transformer: a lightweight approach for natural language inference. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6489–6496
Yu A W, Dohan D, Luong M T, Zhao R, Chen K, Norouzi M, Le Q V. QANet: combining local convolution with global self-attention for reading comprehension. In: Proceedings of the 6th International Conference on Learning Representations. 2018
Dai Z, Yang Z, Yang Y, Carbonell J, Le Q V, Salakhutdinov R. Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 2978–2988
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171–4186
Shen T, Jiang J, Zhou T, Pan S, Long G, Zhang C. DiSAN: directional self-attention network for RNN/CNN-free language understanding. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 5446–5455
Bowman S R, Angeli G, Potts C, Manning C D. A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, 632–642
Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018, 1112–1122
Wang Z, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 4144–4150
Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A. Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018
Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). 2016, arXiv preprint arXiv: 1606.08415
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019
Guo Q, Qiu X, Liu P, Shao Y, Xue X, Zhang Z. Star-transformer. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 1315–1325
Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015, 1556–1566
Wu W, Wang H, Liu T, Ma S. Phrase-level self-attention networks for universal sentence encoding. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 3729–3738
Im J, Cho S. Distance-based self-attention network for natural language inference. 2017, arXiv preprint arXiv: 1712.02047
Kim S, Kang I, Kwak N. Semantic sentence matching with densely-connected recurrent and co-attentive information. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6586–6593
Talman A, Yli-Jyrä A, Tiedemann J. Sentence embeddings in NLI with iterative refinement encoders. Natural Language Engineering, 2019, 25(4): 467–482
Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018, 464–468
Chen K, Wang R, Utiyama M, Sumita E. Recurrent positional embedding for neural machine translation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 1361–1367
Chen K, Wang R, Utiyama M, Sumita E. Neural machine translation with reordering embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1787–1799
Zheng Z, Huang S, Weng R, Dai X Y, Chen J. Improving self-attention networks with sequential relations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1707–1716
Hewitt J, Manning C D. A structural probe for finding syntax in word representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4129–4138
Wang Y, Lee H Y, Chen Y N. Tree transformer: integrating tree structures into self-attention. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 1061–1070
Acknowledgements
This work was supported by the Key Development Program of the Ministry of Science and Technology (2019YFF0303003), the National Natural Science Foundation of China (Grant No.61976068) and “Hundreds, Millions” Engineering Science and Technology Major Special Project of Heilongjiang Province (2020ZX14A02).
Author information
Authors and Affiliations
Corresponding author
Additional information
Le Qi is a PhD student in the Research Center for Social Computing and Information Retrieval, School of Computer Science and Technology, Harbin Institute of Technology, China. His research interests lie in semantic text matching and question answering system.
Yu Zhang is a professor in the Research Center for Social Computing and Information Retrieval, School of Computer Science and Technology, Harbin Institute of Technology, China. His primary research interest is question answering and personalized information retrieval.
Ting Liu is a professor in the Research Center for Social Computing and Information Retrieval, School of Computer Science and Technology, Harbin Institute of Technology, China. His primary research interest is natural language processing, information retrieval, and social computing.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Qi, L., Zhang, Y. & Liu, T. Bidirectional Transformer with absolute-position aware relative position encoding for encoding sentences. Front. Comput. Sci. 17, 171301 (2023). https://doi.org/10.1007/s11704-022-0610-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-022-0610-2