research-article

Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili, and Magahi

Authors:

Anil Kumar SinghAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 1

Article No.: 29, Pages 1 - 20

https://doi.org/10.1145/3533428

Published: 26 May 2023 Publication History

Get Access

Abstract

In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease and so on. Such entities, without an NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognizing and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili, and Magahi are low resource languages, usually known as Purvanchal languages. This article focuses on the development of an NER benchmark dataset for Machine Translation systems developed to translate from these languages to Hindi by annotating parts of the available corpora with named entities. Bhojpuri, Maithili, and Magahi corpora of sizes 228,373, 157,468, and 56,190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning baseline that uses an LSTM-CNNs-CRF model. The lower baseline F₁-scores from the NER tool obtained by using Conditional Random Fields models are 70.56% for Bhojpuri, 73.19% for Maithili, and 84.18% for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 61.41% for Bhojpuri, 71.38% for Maithili, and 86.39% for Magahi. As the results show, LSTM-CNNs-CRF fails to outperform the lower baseline in the case of Bhojpuri and Maithili, which have more data in terms of the number of tokens, but not in terms of the number of named entities. However, the cross-lingual model training of LSTM-CNNs-CRF for Bhojpuri and Maithili performed better than the CRF.

References

[1]

Muhammad Tayyab Ahmad, Muhammad Kamran Malik, Khurram Shahzad, Faisal Aslam, Asif Iqbal, Zubair Nawaz, and Faisal Bukhari. 2020. Named entity recognition and classification for Punjabi Shahmukhi. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 4, (Apr.2020). DOI:

Abstract

References

Cited By

Index Terms

Recommendations

Learning multilingual named entity recognition from Wikipedia

Named-Entity Recognizer for Common Nouns in Filipino Text

Named Entity Disambiguation for Resource-Poor Languages

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Share

Share this Publication link

Share on social media

Affiliations