Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Urdu Named Entity Recognition and Classification System Using Artificial Neural Network

Published: 15 September 2017 Publication History

Abstract

Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, and so on. In this article, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, that is, Person, Organization, and Location, and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of the Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa k statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.

References

[1]
S. Hussain. 2003. In Proceedings of the 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore.
[2]
A. BBC-Languages. Guide to Urdu—10 Facts, Key Phrases and the Alphabet Retrieved May 2, 2012 from from http://www.bbc.co.uk/languages/other/urdu/guide.
[3]
S. Hussain. 2008. Resources for urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources. 99--100.
[4]
R. Grishman and B. Sundheim. 1996. Message understanding conference--6: A brief history. In Proceedings of the International Conference on Computational Linguistics. 466--471.
[5]
P. Baker, A. Hardie, T. McEnery, and B. D. Jayaram. 2003. Corpus data for south asian language processing. In Proceedings of the 10th Annual Workshop for South Asian Language Processing. EACL.
[6]
D. Becker and K. Riaz. 2002. A study in urdu corpus construction. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization-Volume 12 (1--5). Association for Computational Linguistics.
[7]
K. Riaz. 2010. Rule-based named entity recognition in urdu. In Proceedings of the 2010 Named Entities Workshop. Association for Computational Linguistics, 126--135.
[8]
S. Mukund, R. Srihari, and E. Peterson. 2010. An information-extraction system for urdu—A resource-poor language. ACM Trans. Asian Lang. Inf. Process. 9, 4, 15.
[9]
D. Farmakiotou, V. Karkaletsis, J. Koutsias, G. Sigletos, C. D. Spyropoulos, and P. Stamatopoulos. 2000. Rule-based named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX’00). 75--78.
[10]
D. M. Bikel, R. Schwartz, and R. M. Weischedel. 1999. An algorithm that learns what's in a name. Mach. Learn. 34, 1--3, 211--231.
[11]
A. Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation, New York University.
[12]
A. McCallum and W. Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics, Volume 4 188--191.
[13]
A. Ekbal, R. Haque, A. Das, V. Poka, and S. Bandyopadhyay. 2008. Language independent named entity recognition in indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 33--40.
[14]
S. Saha, S. Sarkar, and P. Mitra. 2008. A hybrid feature set based maximum entropy hindi named entity recognition. In Proceedings of the 3rd International Joint Conference on NLP (IJCNLP’08). 343--349.
[15]
K. Gali, H. Surana, A. Vaidya, P. Shishtla, and D. M. Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 25--32.
[16]
P. P. Kumar and V. R. Kiran. 2008. A hybrid named entity recognition system for south asian languages. In Proceedings of the Proceedings of the 3rd International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages (IJCNLP’08). 83--88.
[17]
U. Singh, V. Goyal, and G. S. Lehal. 2012. Named entity recognition system for urdu. In Proceedings of COLING: Technical Papers. 2507--2518.
[18]
S. Mukund and R. K. Srihari. 2009. NE tagging for urdu based on bootstrap POS learning. In Proceedings of the 3rd International Workshop on Cross Lingual Information: Addressing the Information Need of Multilingual Societies. Association of Computational Linguistics, 61--69.
[19]
F. Jahangir, W. Anwar, U. I. Bajwa, and X. Wang. 2012. N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language. In Proceedings of the 24th International Conference on Computational Linguistics.
[20]
Retrieved from http://www.cle.org.pk/clestore/urdudigestcorpus100ktagged.htm.
[21]
T. Ahmed, S. Urooj, S. Hussain, A. Mustafa, R. Parveen, F. Adeeba, A. Hautli, and M. Butt. 2014. The CLE urdu POS tagset. In Proceedings of the Language Resources and Evaluation Conference (LERC’14).
[22]
R. Fernández. 2011. Assessing the Reliability of an Annotation Scheme for Indefinites, Measuring Inter-annotator Agreement. MoL Project, Institute for Logic, Language 8 Computation University of Amsterdam.
[23]
Cohen Jacob 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37--46.
[24]
Arstein Ron and Poesio Massimo. 2008. Survey article: Inter-coder agreement for computational linguistics. Comput. Ling. 34, 4, 555--596.
[25]
L. E. Baum and T. Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Stat. 37, 6, 1554--1563.
[26]
P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4, 467--479.
[27]
C. Samuelsson. 1993. Morphological tagging based entirely on Bayesian inference. In Proceedings of the 9th Nordic Conference on Computational Linguistics.
[28]
T. Brants. 2000. TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 224--231.
[29]
I. Gallo, E. Binaghi, M. Carullo, and N. Lamberti. 2008. Named entity recognition by neural sliding window. In Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (DAS'08). IEEE, 567--573.
[30]
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882.
[31]
G. Petasis, S. Petridis, G. Paliouras, V. Karkaletsis, S. J. Perantonis, and C. D. Spyropoulos. 2000. Symbolic and neural learning for named-entity recognition. In Proceedings of the Symposium on Computational Intelligence and Learning. 58--66.
[32]
J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP’14), 12. 1532--1543.
[33]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137--1155.
[34]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13).

Cited By

View all
  • (2024)Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN ArchitectureACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364836223:4(1-38)Online publication date: 15-Feb-2024
  • (2024)SEEUNRS: Semantically Enriched Entity-Based Urdu News Recommendation SystemACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363904923:3(1-13)Online publication date: 9-Mar-2024
  • (2024)Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467354(1250-1257)Online publication date: 4-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 1
March 2018
152 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3141228
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2017
Accepted: 01 July 2017
Revised: 01 June 2017
Received: 01 February 2016
Published in TALLIP Volume 17, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. NER Data
  3. NER using Deep Learning
  4. Resource Poor Languages
  5. Urdu POS tagged Data
  6. Urdu word2vec

Qualifiers

  • Short-paper
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN ArchitectureACM Transactions on Asian and Low-Resource Language Information Processing10.1145/364836223:4(1-38)Online publication date: 15-Feb-2024
  • (2024)SEEUNRS: Semantically Enriched Entity-Based Urdu News Recommendation SystemACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363904923:3(1-13)Online publication date: 9-Mar-2024
  • (2024)Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467354(1250-1257)Online publication date: 4-Jan-2024
  • (2024)Fine-tuning Urdu NER Models Using Context-Aware Embeddings2024 14th International Conference on Software Technology and Engineering (ICSTE)10.1109/ICSTE63875.2024.00030(133-137)Online publication date: 16-Aug-2024
  • (2024)Extraction and attribution of public figures statements for journalism in Indonesia using deep learningKnowledge-Based Systems10.1016/j.knosys.2024.111558289:COnline publication date: 8-Apr-2024
  • (2023)Using Data Augmentation and Bidirectional Encoder Representations from Transformers for Improving Punjabi Named Entity RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359586122:6(1-13)Online publication date: 16-Jun-2023
  • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 13-Apr-2023
  • (2022)Author Gender Identification for Urdu ArticlesComputational and Corpus-Based Phraseology10.1007/978-3-031-15925-1_16(221-235)Online publication date: 21-Sep-2022
  • (2021)Event classification from the Urdu language text on social mediaPeerJ Computer Science10.7717/peerj-cs.7757(e775)Online publication date: 18-Nov-2021
  • (2021)Authorship Attribution for a Resource Poor Language—UrduACM Transactions on Asian and Low-Resource Language Information Processing10.1145/348706121:3(1-23)Online publication date: 13-Dec-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media