Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili, and Magahi

Published: 26 May 2023 Publication History

Abstract

In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease and so on. Such entities, without an NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognizing and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili, and Magahi are low resource languages, usually known as Purvanchal languages. This article focuses on the development of an NER benchmark dataset for Machine Translation systems developed to translate from these languages to Hindi by annotating parts of the available corpora with named entities. Bhojpuri, Maithili, and Magahi corpora of sizes 228,373, 157,468, and 56,190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning baseline that uses an LSTM-CNNs-CRF model. The lower baseline F1-scores from the NER tool obtained by using Conditional Random Fields models are 70.56% for Bhojpuri, 73.19% for Maithili, and 84.18% for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 61.41% for Bhojpuri, 71.38% for Maithili, and 86.39% for Magahi. As the results show, LSTM-CNNs-CRF fails to outperform the lower baseline in the case of Bhojpuri and Maithili, which have more data in terms of the number of tokens, but not in terms of the number of named entities. However, the cross-lingual model training of LSTM-CNNs-CRF for Bhojpuri and Maithili performed better than the CRF.

References

[1]
Muhammad Tayyab Ahmad, Muhammad Kamran Malik, Khurram Shahzad, Faisal Aslam, Asif Iqbal, Zubair Nawaz, and Faisal Bukhari. 2020. Named entity recognition and classification for Punjabi Shahmukhi. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 4, (Apr.2020). DOI:
[2]
Enrique Alfonseca and Suresh Manandhar. 2002. An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet. 34–43.
[3]
Wazir Ali, Junyu Lu, and Zenglin Xu. 2020. SiNER: A large dataset for Sindhi named entity recognition. In Proceedings of the 12th Language Resources and Evaluation Conference. 2953–2961.
[4]
Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, Nov. (2005), 1817–1853.
[5]
Bogdan Babych and Anthony Hartley. 2003. Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT through Other Language Technology Tools, Resource and Tools for Building MT at EACL 2003.
[6]
M. Saiful Bari, Shafiq Joty, and Prathyusha Jwalapuram. 2020. Zero-resource cross-lingual named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7415–7423.
[7]
P. Behera and S. Muzaffar. 2018. Developing classification-based named entity recognizers (NER) for sambalpuri and odia applying support vector machines (SVM). Nepalese Linguistics. 33, 1 (Nov. 2018), 1–7. DOI:
[8]
Deepti Bhalla, Nisheeth Joshi, and Iti Mathur. 2013. Improving the quality of MT output using novel name entity translation scheme. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 1548–1553.
[9]
Akash Bharadwaj, David R. Mortensen, Chris Dyer, and Jaime G. Carbonell. 2016. Phonologically aware neural model for named entity recognition in low resource transfer settings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1462–1472.
[10]
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. 160–167.
[11]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, Aug. (2011), 2493–2537.
[12]
C. S. Malarkodi, Pattabhi R. K. Rao, and Sobha Lalitha Devi. 2012. Tamil NER—Coping with real time challenges. In Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages. The COLING 2012 Organizing Committee, 23–38. Retrieved from https://www.aclweb.org/anthology/W12-5603.
[13]
Alessandro Cucchiarelli and Paola Velardi. 2001. Unsupervised named entity recognition using syntactic and semantic contextual evidence. Computat. Ling. 27, 1 (2001), 123–131.
[14]
Silviu Cucerzan and David Yarowsky. 1999. Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
[15]
Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2017. NeuroNER: An easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487 (2017).
[16]
Asif Ekbal and Sivaji Bandyopadhyay. 2008. Development of bengali named entity tagged corpus and its use in NER systems. In Proceedings of the 6th Workshop on Asian Language Resources.
[17]
Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell. 165, 1 (2005), 91–134.
[18]
Richárd Farkas and György Szarvas. 2006. Statistical named entity recognition for Hungarian—Analysis of the impact of feature space characteristics. In Proceedings of CESCL (2006).
[19]
Ralph Grishman and Beth M. Sundheim. 1996. Message understanding conference-6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics.
[20]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
[21]
Amandeep Kaur, G. Josan, and Jagroop Kaur. 2009. Named entity recognition for Punjabi: A conditional random field approach. In Proceedings of 7th International Conference on Natural Language Processing.
[22]
Amandeep Kaur and Gurpreet Singh Josan. 2015. Evaluation of named entity features for Punjabi language. Procedia Comput. Sci. 46 (2015), 159–166. DOI:
[23]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-aware neural language models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
[24]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning. 282–289.
[25]
Xuezhe Ma and Eduard Hovy. 2016a. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
[26]
Xuezhe Ma and Eduard Hovy. 2016b. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1064–1074. DOI:
[27]
George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (Nov.1995), 39–41. DOI:
[28]
Sudha Morwal, Nusrat Jahan, and Deepti Chopra. 2012. Named entity recognition using hidden Markov model (HMM). Int. J. Natur. Lang. Comput. 1, 4 (2012), 15–23.
[29]
Rajesh Kumar Mundotiya, Manish Kumar Singh, Rahul Kapur, Swasti Mishra, and Anil Kumar Singh. 2020. Basic linguistic resources and baselines for Bhojpuri, Magahi and Maithili for natural language processing. arXiv preprint arXiv:2004.13945 (2020).
[30]
Muntsa Padró and Lluıs Padró. 2005. A named entity recognition system based on a finite automata acquisition algorithm. Procesam. Del Leng. Natur. 35 (2005), 319–326.
[31]
Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts-step one: The one-million fact extraction challenge. In Proceedings of the AAAI Conference on Artificial Intelligence. 1400–1405.
[32]
Ankur Priyadarshi and Sujan Kumar Saha. 2021. The first named entity recognizer in Maithili: Resource creation and system development. J. Intell. Fuzzy Syst. 41, 1 (2021), 1083–1095.
[33]
Vijayakrishna R. and Sobha Lalithadevi. 2008. Domain focused named entity recognizer for Tamil using conditional random fields. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. Retrieved from https://www.aclweb.org/anthology/I08-5009.
[34]
Pattabhi R. K. Rao, C. S. Malarkodi, R. Vijay Sundar Ram, and Sobha Lalitha Devi. 2015. ESM-IL: Entity extraction from social media text for Indian Languages@ FIRE 2015-an overview. In Proceedings of the FIRE Workshops. 74–80.
[35]
Lisa F. Rau. 1991. Extracting company names from text. In Proceedingsof the 7th IEEE Conference on Artificial Intelligence Application. IEEE, 29–32.
[36]
Ellen Riloff, Rosie Jones, et al. 1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the AAAI Conference on Artificial Intelligence. 474–479.
[37]
Binyamin Rozenfeld, Ronen Feldman, and Moshe Fresko. 2006. A systematic cross-comparison of sequence classifiers. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 564–568.
[38]
Sujan Kumar Saha, Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid feature set based maximum entropy Hindi named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language Processing. Retrieved from https://www.aclweb.org/anthology/I08-1045.
[39]
Cicero Nogueira dos Santos and Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. arXiv preprint arXiv:1505.05008 (2015).
[40]
Padmaja Sharma, Utpal Sharma, and Jugal Kalita. 2010. The first steps towards Assamese named entity recognition. In Brisbane Convention Center, Vol. 1. 1–11.
[41]
Padmaja Sharma, Utpal Sharma, and Jugal Kalita. 2014. Named entity recognition in Assamese using CRFS and rules. In Proceedings of the International Conference on Asian Language Processing (IALP). IEEE, 15–18.
[42]
Anil Kumar Singh. 2008. Named entity recognition for south and south east Asian languages: Taking stock. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages.
[43]
Gitimoni Talukdar, Pranjal Protim Borah, and Arup Baruah. 2014. A survey of named entity recognition in assamese and other Indian languages. arXiv preprint arXiv:1407.2918 (2014).
[44]
Van-Hai Vu, Quang-Phuoc Nguyen, Kiem-Hieu Nguyen, Joon-Choul Shin, and Cheol-Young Ock. 2020. Korean-Vietnamese neural machine translation with named entity recognition and part-of-speech tags. IEICE Trans. Inf. Syst. 103, 4 (2020), 866–873.
[45]
Vikas Yadav, Rebecca Sharp, and Steven Bethard. 2018. Deep affix features improve neural named entity recognizers. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 167–172.
[46]
Shaodian Zhang and Noémie Elhadad. 2013. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. J. Biomed. Inform. 46, 6 (2013), 1088–1098.

Cited By

View all
  • (2024)A hybrid approach for Bengali sentence validationArtificial Intelligence Review10.1007/s10462-024-10795-257:11Online publication date: 7-Oct-2024
  • (2023)Design of Machine Automatic Translation System Based on Artificial Intelligence2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE)10.1109/AIKIIE60097.2023.10390173(1-5)Online publication date: 2-Nov-2023

Index Terms

  1. Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili, and Magahi

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 1
          January 2023
          340 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3572718
          Issue’s Table of Contents

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 26 May 2023
          Online AM: 18 July 2022
          Accepted: 24 April 2022
          Revised: 11 March 2022
          Received: 04 October 2020
          Published in TALLIP Volume 22, Issue 1

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Indo-Aryan languages
          2. low resource languages
          3. Purvanchal languages
          4. Bhojpuri
          5. Maithili
          6. Magahi
          7. named entity recognition
          8. Conditional Random Fields
          9. Deep Learning

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)175
          • Downloads (Last 6 weeks)18
          Reflects downloads up to 16 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)A hybrid approach for Bengali sentence validationArtificial Intelligence Review10.1007/s10462-024-10795-257:11Online publication date: 7-Oct-2024
          • (2023)Design of Machine Automatic Translation System Based on Artificial Intelligence2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE)10.1109/AIKIIE60097.2023.10390173(1-5)Online publication date: 2-Nov-2023

          View Options

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          Full Text

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media