CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition
<p>The overall structure of Character-Word-Position Combined Bidirectional Long Short Term Memory-Attention (CWPC_BiAtt) is divided into four layers: (1) comprehensive-embedding, which is composed of character-embedding, word-embedding, and position-embedding. (2) Bidirectional Long Short Term Memory (BiLSTM) layer, the role of this layer is to capture historical information and future information in the sentence of this paper. (3) self-attention layer captures the connection between several arbitrary positions in the sentence, using softmax to normalize the value output by attention-mechanism. (4) conditional random field(CRF) layer first uses tanh function to quickly change the gap between features, and then uses CRF for marking. B in B-LOC means Begin, LOC means location, E in E-LOC means end, and LOC means location.</p> "> Figure 2
<p>To facilitate the explanation of comprehensive-embedding, we assume sentence “ABCDE” is a Chinese sentence. Suppose sentence “ABCDE” can become “AB/CD/E” after the word segmentation. AB, CD, E are Chinese words after the word segmentation. This sentence consists of five comprehensive-embedding: (1) character (A) + word (AB) + pos (first position); (2) character (B) + word (AB) + pos (second position); (3) character (C) + word (CD) + pos (third position); (4) character (D) + word (CD) + pos (forth position); (5) character (E) + word (E) + pos (fifth position).</p> "> Figure 3
<p>The structure of Long Short Term Memory.</p> "> Figure 4
<p>Experiments were performed on sentences of different lengths. As the sentence length increased, the results obtained by bidirectional gated recurrent unit (BiGRU) became worse, but its F-score was stable above 0.9090. Comprehensive-embedding carries sufficient information to make up for the shortcomings of BiGRU in processing long sentences. Our Character–Word–Position Combined bidirectional long short-term memory-attention (CWPC_BiAtt) model is very stable, and the F-score remains at 0.9298–0.9303.</p> "> Figure 5
<p>The second local self-attention window setting to 2w + 1, w = 3 can obtain the maximum value. When w > 3, the grasp of information increased, there gradually appears over-fitting, precision, and recall value also begin to decrease.</p> "> Figure 6
<p>As shown in CWPC_BiAtt and other experiments, the results gained by using comprehensive-embedding are better than that selectively utilizing char-embedding, word-embedding, and pos-embedding. The F1-score of CWPC_BiAtt has increased by 1.2% than that of (char + word), in that comprehensive-embedding provides an index for sentences, which records the positional relation among embedding.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. NER
2.2. Attention Mechanism
3. Model
3.1. Comprehensive-Embedding
3.2. BiLSTM Layer
3.3. Attention Layer
3.3.1. Global Self-Attention
3.3.2. Local Self-Attention
3.3.3. Multi-Head Self-Attention
3.4. CRF Layer
4. Experiments and Analyze
4.1. Evaluation Metrics
4.2. Datasets
4.2.1. MSRA Dataset
4.2.2. Weibo NER Corpus
4.3. Settings
4.4. Experiment and Analyze on MSRA Dataset
4.5. Experiment and Analyze on Weibo NER Corpus
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Peng, N.; Dredze, M. Named entity recognition for Chinese social media with jointly trained embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 548–554. [Google Scholar]
- Chen, Y.; Xu, L.; Liu, K.; Zeng, D.; Zhao, J. Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 167–176. [Google Scholar]
- Bunescu, R.C.; Mooney, R.J. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 724–731. [Google Scholar]
- Miwa, M.; Bansal, M. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1105–1116. [Google Scholar]
- Ratinov, L.; Roth, D.; Downey, D.; Anderson, M. Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; Volume 1, pp. 1375–1384. [Google Scholar]
- Gupta, N.; Singh, S.; Roth, D. Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 2681–2690. [Google Scholar]
- Yao, X.; Van Durme, B. Information extraction over structured data: Question answering with freebase. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 956–966. [Google Scholar]
- Isozaki, H.; Kazawa, H. Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th International Conference on Computational Linguistics—Volume 1; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 1–7. [Google Scholar]
- Kazama, J.; Makino, T.; Ohta, Y.; Tsujii, J. Tuning support vector machines for biomedical named entity recognition. In Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain—Volume 3; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 1–8. [Google Scholar]
- Ekbal, A.; Bandyopadhyay, S. Named entity recognition using support vector machine: A language independent approach. Int. J. Electr. Comput. Eng. 2010, 4, 589–604. [Google Scholar]
- Zhou, G.; Su, J. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 473–480. [Google Scholar]
- Florian, R.; Ittycheriah, A.; Jing, H.; Zhang, T. Named entity recognition through classifier combination. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003—Volume 4; Association for Computational Linguistics: Stroudsburg, PA, USA, 2003; pp. 168–171. [Google Scholar]
- Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2001; pp. 282–289. [Google Scholar]
- Sobhana, N.; Mitra, P.; Ghosh, S.K. Conditional random field based named entity recognition in geological text. Int. J. Comput. Appl. 2010, 1, 119–125. [Google Scholar] [CrossRef]
- Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jordan, M.I. Serial order: A parallel distributed processing approach. In Neural-Network Models of Cognition; Advances in Psychology Series Volume 121; Donahoe, J.W., Dorsel, V.P., Eds.; North-Holland Publishing: Amsterdam, The Netherlands, 1997; pp. 471–495. [Google Scholar]
- Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Chiu, J.P.C.; Nichols, E. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Cao, P.; Chen, Y.; Liu, K.; Zhao, J.; Liu, S. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 182–192. [Google Scholar]
- Chen, A.; Peng, F.; Shan, R.; Sun, G. Chinese named entity recognition with conditional probabilistic models. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2006; pp. 173–176. [Google Scholar]
- Zhou, J.; He, L.; Dai, X.; Chen, J. Chinese named entity recognition with a multi-phase model. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2006; pp. 213–216. [Google Scholar]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1724–1734. [Google Scholar]
- He, H.; Sun, X. F-score driven max margin neural network for named entity recognition in Chinese social media. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 713–718. [Google Scholar]
- Zhu, Y.; Wang, G.; Karlsson, B.F. CAN-NER: Convolutional attention network for Chinese named entity recognition. arXiv 2019, arXiv:1904.02141. [Google Scholar]
- Peng, N.; Dredze, M. Improving named entity recognition for Chinese social media with word segmentation representation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 149–155. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2204–2212. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 207–212. [Google Scholar]
- McCallum, A.; Freitag, D.; Pereira, F.C.N. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2000; pp. 591–598. [Google Scholar]
- Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef] [Green Version]
- Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 260–270. [Google Scholar]
- He, H.; Sun, X. A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3216–3222. [Google Scholar]
- Li, S.; Zhao, Z.; Hu, R.; Li, W.; Liu, T.; Du, X. Analogical reasoning on Chinese morphological and semantic relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 138–143. [Google Scholar]
- Han, A.L.-F.; Zeng, X.; Wong, D.F.; Chao, L.S. Chinese named entity recognition with graph-based semi-supervised learning model. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 15–20. [Google Scholar]
- Zhang, Y.; Yang, J. Chinese NER using lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Dataset | Type | Train | Dev | Test |
---|---|---|---|---|
MSRA dataset | Sentences | 46.4 k | - | 4.4 k |
Characters | 2169.9 k | - | 172.6 k | |
Entities | 74.8 k | - | 6.2 k | |
Weibo NER corpus | Sentences | 1.4 k | 0.27 k | 0.27 k |
Characters | 73.8 k | 14.8 k | 14.5 k | |
Entities | 1.89 k | 0.42 k | 0.39 k |
Name | Nominal | Total | |
---|---|---|---|
Weibo NER corpus [1] | 1276 | 705 | 1981 |
Weibo NER corpus by revised [38] | 1321 | 1322 | 2643 |
Weibo NER corpus by revised [25] | 1319 | 1320 | 2639 |
WeiboNER_2nd_conll | |||
Train set | 1018 | 859 | 1877 |
Dev set | 167 | 219 | 386 |
Test set | 216 | 196 | 412 |
Total | 1401 | 1274 | 2675 |
Function | Node Name | Model | CPU |
---|---|---|---|
Login node | 1–2 | I620-G30 | two Intel Xeon Gold 5118 CPU @ 2.30 Hz |
Calculate node | 1–110 | CX50-G30 | two Intel Xeon Gold 6132 CPU @ 2.60 GHz |
Models | P (%) | R (%) | F1 (%) |
---|---|---|---|
Conditional Probabilistic Models [22] | 91.22 | 81.71 | 86.2 |
Multi-Phase Model [23] | 88.94 | 84.2 | 86.51 |
Graph-based Semi-supervised [40] | 90.62 | 77.84 | 83.74 |
Adversarial Transfer Learning [21] | 91.73 | 89.58 | 90.64 |
CAN-NER [26] | 93.53 | 92.42 | 92.97 |
CWPC_BiAtt (ours) | 93.71 | 92.29 | 92.99 |
Models | Name | Nominal | Overall | ||||
---|---|---|---|---|---|---|---|
P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | F1 (%) | |
Joint (cp) without fine [1] | 61.9 | 33.5 | 43.5 | 65.9 | 28.1 | 39.4 | - |
Joint (cp) with fine tuning [1] | 58 | 35.6 | 44.1 | 63.8 | 29.5 | 40.4 | - |
Jointly Train LSTM + Emb [27] | 63.3 | 39.2 | 48.4 | 58.6 | 37.4 | 45.7 | - |
F-Score II Driven (proposal) [25] | 66.9 | 40.7 | 50.6 | 66.5 | 53.6 | 59.3 | 54.8 |
Unified Model (proposal) [38] | 61.7 | 48.8 | 54.5 | 74.1 | 53.5 | 62.2 | 58.2 |
Adversarial Transfer Learning [21] | 59.5 | 50 | 54.3 | 71.4 | 47.9 | 57.4 | 58.7 |
Lattice LSTM [41] | - | - | 53 | - | - | 62.3 | 58.8 |
CAN-NER [26] | - | - | 55.4 | - | - | 63 | 59.3 |
CWPC_BiAtt (Ours) | 63.4 | 49.7 | 55.7 | 74.1 | 54.5 | 62.8 | 59.5 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Johnson, S.; Shen, S.; Liu, Y. CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition. Information 2020, 11, 45. https://doi.org/10.3390/info11010045
Johnson S, Shen S, Liu Y. CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition. Information. 2020; 11(1):45. https://doi.org/10.3390/info11010045
Chicago/Turabian StyleJohnson, Shardrom, Sherlock Shen, and Yuanchen Liu. 2020. "CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition" Information 11, no. 1: 45. https://doi.org/10.3390/info11010045
APA StyleJohnson, S., Shen, S., & Liu, Y. (2020). CWPC_BiAtt: Character–Word–Position Combined BiLSTM-Attention for Chinese Named Entity Recognition. Information, 11(1), 45. https://doi.org/10.3390/info11010045