Abstract
Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.
Similar content being viewed by others
References
Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, 473–480
Chieu H L, Ng H T. Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics. 2002, 1: 1–7
Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning. 2002, 20: 1–7
Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 104–107
Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 168–171
Klein D, Smarr J, Nguyen H, Manning C D. Named entity recognition with character-level models. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 180–183
Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 2005, 6(Suppl 1): S5
Ciaramita M, Altun Y. Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2005, 209–212
Resnik P, Smith N A. The web as a parallel corpus. Computational Linguistics, 2003, 29(3): 349–380
Zhang Y, Wu K, Gao J, Vines P. Automatic acquisition of chinese-english parallel corpus from the web. In: Proceedings of the 28th European Conference on Advances in Information Retrieval. 2006, 420-431
Fu R, Qin B, Liu T. Generating Chinese named entity data from a parallel corpus. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 264–272
Yarowsky D, Ngai G, Wicentowski R. Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the 1st International Conference on Human Language Technology Research. 2001, 1–8
Huang F, Vogel S. Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 2002, 253–258
Burkett D, Petrov S, Blitzer J, Klein D. Learning better monolingual models with unannotated bilingual text. In: Proceedings of the 14th Conference on Computational Natural Language Learning. 2010, 46–54
Hassan A, Fahmy H, Hassan H. Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the 2007 Workshop on Acquisition and Management of Multilingual Lexicons. 2007, 1–6
An J, Lee S, Lee G G. Automatic acquisition of named entity tagged corpus from world wide web. In: Proceedings of the 41st AnnualMeeting on Association for Computational Linguistics. 2003, 2: 165–168
Whitelaw C, Kehlenbeck A, Petrovic N, Ungar L. Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 123–132
Richman A E, Schone P. MiningWiki resources for multilingual named entity recognition. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. 2008, 1–9
Nothman J, Curran J R, Murphy T. Transforming wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop. 2008, 124–132
Vlachos A, Gasperin C. Bootstrapping and evaluating named entity recognition in the biomedical domain. In: Proceedings of the 2006 HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, 138–145
Ma X. Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006, 489–492
Och F J, Ney H. Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000, 440–447
Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003, 1: 48–54
Lafferty J. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289
Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 1996 International Conference on Computational Linguistics. 1996, 466–471
Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007, 30(1): 3–26
De Sitter A, Calders T, Daelemans W. A formal framework for evaluation of information extraction. Technical Report TR 2004-0, Department of Mathematics and Computer Science, University of Antwerp, 2004
Che W, Li Z, Liu T. LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. 2010, 13–16
Wu Y, Zhao J, Xu B, Yu H. Chinese named entity recognition based on multiple features. In: Proceedings of the 2005 Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 427–434
Zhang Y, Vogel S, Waibel A. Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 International Conference on Language Resources and Evaluation. 2004, 2051–2054
Author information
Authors and Affiliations
Corresponding author
Additional information
Ruiji Fu is a PhD candidate at Harbin Institute of Technology (HIT), China. He received his MS and BS both in Computer Science from HIT, in 2009 and 2007 respectively. His research interests include natural language processing, text mining, and open information extraction.
Bing Qin received her PhD in computer science from Harbin Institute of Technology (HIT), China in 2005. She is a full professor in the School of Computer Science and Technology, HIT. Her research interests include natural language processing, text mining, and opinion mining.
Ting Liu received his PhD in computer science from Harbin Institute of Technology (HIT), China in 1998. He is a full professor in the School of Computer Science and Technology, HIT. His current research interests include natural language processing, information retrieval, and social computing.
Rights and permissions
About this article
Cite this article
Fu, R., Qin, B. & Liu, T. Generating Chinese named entity data from parallel corpora. Front. Comput. Sci. 8, 629–641 (2014). https://doi.org/10.1007/s11704-014-3127-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-014-3127-5