Generating Chinese named entity data from parallel corpora

Ruiji Fu¹,
Bing Qin¹ &
Ting Liu¹

116 Accesses
8 Citations
Explore all metrics

Abstract

Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Article 07 November 2015

Named Entity Recognition Based on Bilingual Co-training

TL-NER: A Transfer Learning Model for Chinese Named Entity Recognition

Article 04 June 2019

References

Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, 473–480
Google Scholar
Chieu H L, Ng H T. Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics. 2002, 1: 1–7
Article Google Scholar
Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning. 2002, 20: 1–7
Google Scholar
Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 104–107
Google Scholar
Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 168–171
Google Scholar
Klein D, Smarr J, Nguyen H, Manning C D. Named entity recognition with character-level models. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 180–183
Google Scholar
Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 2005, 6(Suppl 1): S5
Article Google Scholar
Ciaramita M, Altun Y. Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2005, 209–212
Google Scholar
Resnik P, Smith N A. The web as a parallel corpus. Computational Linguistics, 2003, 29(3): 349–380
Article Google Scholar
Zhang Y, Wu K, Gao J, Vines P. Automatic acquisition of chinese-english parallel corpus from the web. In: Proceedings of the 28th European Conference on Advances in Information Retrieval. 2006, 420-431
Fu R, Qin B, Liu T. Generating Chinese named entity data from a parallel corpus. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 264–272
Google Scholar
Yarowsky D, Ngai G, Wicentowski R. Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the 1st International Conference on Human Language Technology Research. 2001, 1–8
Google Scholar
Huang F, Vogel S. Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 2002, 253–258
Chapter Google Scholar
Burkett D, Petrov S, Blitzer J, Klein D. Learning better monolingual models with unannotated bilingual text. In: Proceedings of the 14th Conference on Computational Natural Language Learning. 2010, 46–54
Google Scholar
Hassan A, Fahmy H, Hassan H. Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the 2007 Workshop on Acquisition and Management of Multilingual Lexicons. 2007, 1–6
Google Scholar
An J, Lee S, Lee G G. Automatic acquisition of named entity tagged corpus from world wide web. In: Proceedings of the 41st AnnualMeeting on Association for Computational Linguistics. 2003, 2: 165–168
Google Scholar
Whitelaw C, Kehlenbeck A, Petrovic N, Ungar L. Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 123–132
Google Scholar
Richman A E, Schone P. MiningWiki resources for multilingual named entity recognition. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. 2008, 1–9
Google Scholar
Nothman J, Curran J R, Murphy T. Transforming wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop. 2008, 124–132
Google Scholar
Vlachos A, Gasperin C. Bootstrapping and evaluating named entity recognition in the biomedical domain. In: Proceedings of the 2006 HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, 138–145
Google Scholar
Ma X. Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006, 489–492
Google Scholar
Och F J, Ney H. Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000, 440–447
Google Scholar
Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003, 1: 48–54
Article Google Scholar
Lafferty J. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289
Google Scholar
Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 1996 International Conference on Computational Linguistics. 1996, 466–471
Google Scholar
Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007, 30(1): 3–26
Article Google Scholar
De Sitter A, Calders T, Daelemans W. A formal framework for evaluation of information extraction. Technical Report TR 2004-0, Department of Mathematics and Computer Science, University of Antwerp, 2004
Google Scholar
Che W, Li Z, Liu T. LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. 2010, 13–16
Google Scholar
Wu Y, Zhao J, Xu B, Yu H. Chinese named entity recognition based on multiple features. In: Proceedings of the 2005 Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 427–434
Chapter Google Scholar
Zhang Y, Vogel S, Waibel A. Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 International Conference on Language Resources and Evaluation. 2004, 2051–2054
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Ruiji Fu, Bing Qin & Ting Liu

Authors

Ruiji Fu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Qin
View author publications
You can also search for this author in PubMed Google Scholar
Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Liu.

Additional information

Ruiji Fu is a PhD candidate at Harbin Institute of Technology (HIT), China. He received his MS and BS both in Computer Science from HIT, in 2009 and 2007 respectively. His research interests include natural language processing, text mining, and open information extraction.

Bing Qin received her PhD in computer science from Harbin Institute of Technology (HIT), China in 2005. She is a full professor in the School of Computer Science and Technology, HIT. Her research interests include natural language processing, text mining, and opinion mining.

Ting Liu received his PhD in computer science from Harbin Institute of Technology (HIT), China in 1998. He is a full professor in the School of Computer Science and Technology, HIT. His current research interests include natural language processing, information retrieval, and social computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fu, R., Qin, B. & Liu, T. Generating Chinese named entity data from parallel corpora. Front. Comput. Sci. 8, 629–641 (2014). https://doi.org/10.1007/s11704-014-3127-5

Download citation

Received: 12 April 2013
Accepted: 13 December 2013
Published: 25 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11704-014-3127-5

Generating Chinese named entity data from parallel corpora

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Named Entity Recognition Based on Bilingual Co-training

TL-NER: A Transfer Learning Model for Chinese Named Entity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Generating Chinese named entity data from parallel corpora

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Named Entity Recognition Based on Bilingual Co-training

TL-NER: A Transfer Learning Model for Chinese Named Entity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation