short-paper

Urdu Named Entity Recognition and Classification System Using Artificial Neural Network

Author:

Muhammad Kamran MalikAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 17, Issue 1

Article No.: 2, Pages 1 - 13

https://doi.org/10.1145/3129290

Published: 15 September 2017 Publication History

Abstract

Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.

References

[1]

S. Hussain. 2003. In Proceedings of the 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore.

[2]

A. BBC-Languages. Guide to Urdu—10 Facts, Key Phrases and the Alphabet Retrieved May 2, 2012 from from http://www.bbc.co.uk/languages/other/urdu/guide.

[3]

S. Hussain. 2008. Resources for urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources. 99--100.

[4]

R. Grishman and B. Sundheim. 1996. Message understanding conference--6: A brief history. In Proceedings of the International Conference on Computational Linguistics. 466--471.

Digital Library

[5]

P. Baker, A. Hardie, T. McEnery, and B. D. Jayaram. 2003. Corpus data for south asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing. EACL.

[6]

D. Becker and K. Riaz. 2002. A study in urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12 (1--5). Association for Computational Linguistics.

Digital Library

[7]

K. Riaz. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135.

Digital Library

[8]

S. Mukund, R. Srihari, and E. Peterson. 2010. An information-extraction system for urdu—A resource-poor language. ACM Trans. Asian Lang. Inf. Process. 9, 4, 15.

Digital Library

[9]

D. Farmakiotou, V. Karkaletsis, J. Koutsias, G. Sigletos, C. D. Spyropoulos, and P. Stamatopoulos. 2000. Rule-based named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX’00). 75--78.

[10]

D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Mach. Learn. 34, 1--3, 211--231.

Digital Library

[11]

A. Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation, New York University.

Digital Library

[12]

A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics, Volume 4 188--191.

Digital Library

[13]

A. Ekbal, R. Haque, A. Das, V. Poka, and S. Bandyopadhyay. 2008. Language independent named entity recognition in indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 33--40.

[14]

S. Saha, S. Sarkar, and P. Mitra. 2008. A hybrid feature set based maximum entropy hindi named entity recognition. In Proceedings of the 3rd International Joint Conference on NLP (IJCNLP’08). 343--349.

[15]

K. Gali, H. Surana, A. Vaidya, P. Shishtla, and D. M. Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 25--32.

[16]

P. P. Kumar and V. R. Kiran. 2008. A hybrid named entity recognition system for south asian languages. In Proceedings of the Proceedings of the 3rd International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages (IJCNLP’08). 83--88.

[17]

U. Singh, V. Goyal, and G. S. Lehal. 2012. Named entity recognition system for urdu. In Proceedings of COLING: Technical Papers. 2507--2518.

[18]

S. Mukund and R. K. Srihari. 2009. NE tagging for urdu based on bootstrap POS learning. In Proceedings of the 3rd International Workshop on Cross Lingual Information: Addressing the Information Need of Multilingual Societies. Association of Computational Linguistics, 61--69.

Digital Library

[19]

F. Jahangir, W. Anwar, U. I. Bajwa, and X. Wang. 2012. N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics.

[20]

Retrieved from http://www.cle.org.pk/clestore/urdudigestcorpus100ktagged.htm.

[21]

T. Ahmed, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli, and M. Butt. 2014. The CLE urdu POS tagset. In Proceedings of the Language Resources and Evaluation Conference (LERC’14).

[22]

R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites, Measuring Inter-annotator Agreement. MoL Project, Institute for Logic, Language 8 Computation University of Amsterdam.

[23]

Cohen Jacob 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37--46.

[24]

Arstein Ron and Poesio Massimo. 2008. Survey article: Inter-coder agreement for computational linguistics. Comput. Ling. 34, 4, 555--596.

Digital Library

[25]

L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Stat. 37, 6, 1554--1563.

[26]

P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4, 467--479.

Digital Library

[27]

C. Samuelsson. 1993. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference on Computational Linguistics.

[28]

T. Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224--231.

Digital Library

[29]

I. Gallo, E. Binaghi, M. Carullo, and N. Lamberti. 2008. Named entity recognition by neural sliding window. In Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (DAS'08). IEEE, 567--573.

Digital Library

[30]

E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882.

Digital Library

[31]

G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. J. Perantonis, and C. D. Spyropoulos. 2000. Symbolic and neural learning for named-entity recognition. In Proceedings of the Symposium on Computational Intelligence and Learning. 58--66.

[32]

J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP’14), 12. 1532--1543.

[33]

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137--1155.

Digital Library

[34]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13).

Cited By

Ahmed AHuang DArafat SHameed I(2024)Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN ArchitectureACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364836223:4(1-38)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1145/3648362
Kanwal SMalik MNawaz ZMehmood K(2024)SEEUNRS: Semantically Enriched Entity-Based Urdu News Recommendation SystemACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363904923:3(1-13)Online publication date: 9-Mar-2024
https://dl.acm.org/doi/10.1145/3639049
Singh GBhandari RSingh P(2024)Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467354(1250-1257)Online publication date: 4-Jan-2024
https://doi.org/10.1109/IDCIoT59759.2024.10467354
Show More Cited By

Index Terms

Urdu Named Entity Recognition and Classification System Using Artificial Neural Network
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian ...
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
A joint named entity recognition and entity linking system
HYBRID '12: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

We present a joint system for named entity recognition (NER) and entity linking (EL), allowing for named entities mentions extracted from textual data to be matched to uniquely identifiable entities. Our approach relies on combined NER modules which ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 17, Issue 1

March 2018

152 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3141228

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2017

Accepted: 01 July 2017

Revised: 01 June 2017

Received: 01 February 2016

Published in TALLIP Volume 17, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
735
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ahmed AHuang DArafat SHameed I(2024)Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN ArchitectureACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364836223:4(1-38)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1145/3648362
Kanwal SMalik MNawaz ZMehmood K(2024)SEEUNRS: Semantically Enriched Entity-Based Urdu News Recommendation SystemACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363904923:3(1-13)Online publication date: 9-Mar-2024
https://dl.acm.org/doi/10.1145/3639049
Singh GBhandari RSingh P(2024)Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467354(1250-1257)Online publication date: 4-Jan-2024
https://doi.org/10.1109/IDCIoT59759.2024.10467354
Azhar NLatif SArshad S(2024)Fine-tuning Urdu NER Models Using Context-Aware Embeddings2024 14th International Conference on Software Technology and Engineering (ICSTE)10.1109/ICSTE63875.2024.00030(133-137)Online publication date: 16-Aug-2024
https://doi.org/10.1109/ICSTE63875.2024.00030
Purnomo W.P. YKumar YZulkarnain NRaza B(2024)Extraction and attribution of public figures statements for journalism in Indonesia using deep learningKnowledge-Based Systems10.1016/j.knosys.2024.111558289:COnline publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111558
Khalid HMurtaza GAbbas Q(2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
https://dl.acm.org/doi/10.1145/3595861
Khan SNazir SKhan H(2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3592600
Sarwar R(2022)Author Gender Identification for Urdu ArticlesComputational and Corpus-Based Phraseology10.1007/978-3-031-15925-1_16(221-235)Online publication date: 21-Sep-2022
https://doi.org/10.1007/978-3-031-15925-1_16
Awan MKajla NFirdous AHusnain MMissen M(2021)Event classification from the Urdu language text on social mediaPeerJ Computer Science10.7717/peerj-cs.7757(e775)Online publication date: 18-Nov-2021
https://doi.org/10.7717/peerj-cs.775
Nazir ZShahzad KMalik MAnwar WBajwa IMehmood K(2021)Authorship Attribution for a Resource Poor Language—UrduACM Transactions on Asian and Low-Resource Language Information Processing10.1145/348706121:3(1-23)Online publication date: 13-Dec-2021
https://dl.acm.org/doi/10.1145/3487061
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents