Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

Published: 07 May 2019 Publication History

Abstract

Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word sense ambiguity, and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare, and evaluate WSD techniques. These are available for many languages, but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains 5,042 words of Urdu running text in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-gram are applied to the corpus, and the best performance (accuracy of 57.71%) is achieved using word 4-gram. The corpus is freely available to the research community to encourage further WSD research in Urdu.

References

[1]
Muhammad Abid, Asad Habib, Jawad Ashraf, and Abdul Shahid. 2017. Urdu word sense disambiguation using machine learning approach. Cluster Computing 21, 1 (2017), 515--522.
[2]
E. Agirre, I. Aldezabal, J. Etxeberria, E. Izagirre, K. Mendizabal, E. Pociello, and M. Quintian. 2005. EUSEMCOR: Euskarako Corpusa Semantikoki Etiketatzeko Eskuliburua; Editatze-, Etiketatze-Eta Epaitze-Lanak. Internal Technical Report.
[3]
E. Agirre, O. Lopez de Lacalle, C. Fellbaum, A. Marchetti, A. Toral, P. T. J. M. Vossen, L. Màrques, et al. 2010. SemEval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10). 75--80.
[4]
James Allen. 1995. Natural Language Understanding. Pearson.
[5]
Syed Zulqarnain Arif, Muhammad Mateen Yaqoob, Atif Rehman, and Fuzel Jamil. 2016. Word sense disambiguation for Urdu text by machine learning. International Journal of Computer Science and Information Security 14, 5 (2016), 738.
[6]
Inger Askehave and John M. Swales. 2001. Genre identification and communicative purpose: A problem and a possible solution. Applied Linguistics 22, 2 (2001), 195--212.
[7]
John Bateman and Michael Zock. 2003. Natural language generation. In The Oxford Handbook of Computational Linguistics (2nd ed.), R. Mitkov (Ed.). Oxford University Press, Oxford, UK, 284--304.
[8]
Luisa Bentivogli, Christian Girardi, and Emanuele Pianta. 2003. The MEANING Italian corpus. In Proceedings of the 2003 Corpus Linguistics Conference. 103--112.
[9]
Tim Berners-Lee, James Hendler, and Ora Lassila. 2001. The semantic web. Scientific American 284, 5 (2001), 34--43.
[10]
Urdu Dictionary Board. 2008. Urdu Lughat. Urdu Lughat Board, Karachi, Pakistan.
[11]
Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. 2012. Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th Global WordNet Conference (GWC’12). 56--63.
[12]
Abraham Bookstein and Don Kraft. 1977. Operations research applied to document indexing and retrieval decisions. Journal of the ACM 24, 3 (1977), 418--427.
[13]
Rebecca Bruce and Janyce Wiebe. 1994. Word-sense disambiguation using decomposable models. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. 139--146.
[14]
Stefano Ceri, Adnan Abid, Mamoun Abu Helou, Davide Barbieri, Alessandro Bozzon, Daniele Braga, Marco Brambilla, et al. 2010. Search computing: Managing complex search queries. IEEE Internet Computing 14, 6 (2010), 14--22.
[15]
Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City 1, 2 (2007), 1.
[16]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the 2006 22nd International Conference on Data Engineering (ICDE’06). IEEE, Los Alamitos, CA, 5.
[17]
Ido Dagan, Lillian Lee, and Fernando Pereira. 1997. Similarity-based methods for word sense disambiguation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics. 56--63.
[18]
Nadir Durrani and Sarmad Hussain. 2010. Urdu word segmentation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 528--536.
[19]
Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. 1--5.
[20]
Philip Edmonds and Adam Kilgarriff. 2002. Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineering 8, 4 (2002), 279--291.
[21]
Paul Ekman. 1999. Basic emotions. In Handbook of Cognition and Emotion, T. Dalgleish and M. Power (Eds.). John Wiley 8 Sons, West Sussex, England, 45--60.
[22]
Mohamed Abdel Fattah and Fuji Ren. 2008. Automatic text summarization. World Academy of Science, Engineering and Technology 37 (2008), 2008.
[23]
Wael H. Gomaa and Aly A. Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.
[24]
Udo Hahn and Inderjeet Mani. 2000. The challenges of automatic summarization. Computer 33, 11 (2000), 29--36.
[25]
Nina Heck and Bettina Mohr. 2017. Response hand differentially affects action word processing. Frontiers in Psychology 8 (2017), 2223.
[26]
Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.
[27]
W. John Hutchins. 1995. Machine translation: A brief history. In Concise History of the Language Sciences. Elsevier, 431--445.
[28]
Rubén Izquierdo-Beviá, Lorenza Moreno-Monteagudo, Borja Navarro, and Armando Suárez. 2006. Spanish all-words semantic class disambiguation using cast3lb corpus. In Proceedings of the Mexican International Conference on Artificial Intelligence. Springer, 879--888.
[29]
Bushra Jawaid, Amir Kamran, and Ondrej Bojar. 2014. A tagged corpus and a tagger for Urdu. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 2938--2943.
[30]
Jing Jiang. 2012. Information extraction from text. In Mining Text Data. Springer, 11--41.
[31]
Wahab Khan, Ali Daud, Jamal A. Nasir, and Tehmina Amjad. 2016. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science 43, 4 (2016), 66--84.
[32]
Adam Kilgarriff. 2004. How dominant is the commonest sense of a word? In Proceedings of the International Conference on Text, Speech, and Dialogue. 103--111.
[33]
Svetla Koeva, Sv Leseva, and Maria Todorova. 2006. Bulgarian sense tagged corpus. In Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages. 79--87.
[34]
Lawrence R. Lawlor. 1980. Overlap, similarity, and competition coefficients. Ecology 61, 2 (1980), 245--251.
[35]
Claudia Leacock, Geoffrey Towell, and Ellen Voorhees. 1993. Corpus-based statistical sense resolution. In Proceedings of the Workshop on Human Language Technology. 260--265.
[36]
Gurpreet Lehal. 2010. A word segmentation system for handling space omission problem in Urdu script. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 43--50.
[37]
John B. MacArthur. 1988. An analysis of the content of corporate submissions on proposed accounting standards in the UK. Accounting and Business Research 18, 71 (1988), 213--226.
[38]
Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The SENSEVAL-3 English lexical sample task. In Proceedings of SENSEVAL-3, the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.
[39]
Neetu Mishra and Tanveer J. Siddiqui. 2012. An investigation to semi supervised approach for HINDI word sense disambiguation. In Proceedings of Trends in Innovative Computing 2012: Intelligent Systems Design.
[40]
Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, et al. 2003. Building the Italian syntactic-semantic treebank. In Treebanks. Springer, 189--210.
[41]
Miguel Murguía and José Luis Villaseñor. 2003. Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications. In Annales Botanici Fennici. JSTOR, 415--421.
[42]
Dipak Narayan, Debasri Chakrabarti, Prabhakar Pande, and Pushpak Bhattacharyya. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the 1st International Conference on Global WordNet.
[43]
Asma Naseer and Sarmad Hussain. 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan.
[44]
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 10.
[45]
A. Saeed, R. M. A. Nawab, M. Stevenson, and P. Rayson. 2018. A word sense disambiguation corpus for Urdu. In Language Resources and Evaluation. Springer, 1--22.
[46]
Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX99: Standardizing Lexical Resources.
[47]
Hieu V. Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Proceedings of the Asian Conference on Computer Vision. 709--720.
[48]
Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1.
[49]
Francois Paradis and Catherine Berrut. 1996. Experiments with theme extraction in explanatory texts. In Proceedings of the 2nd International Conference on Conceptions of Library and Information Science (CoLIS’96). 13--16.
[50]
Rebecca J. Passonneau, Collin Baker, Christiane Fellbaum, and Nancy Ide. 2012. The MASC word sense sentence corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 3025--3030.
[51]
Michel Pêcheux. 1995. Automatic Discourse Analysis. Vol. 5. Rodopi.
[52]
Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. 87--92.
[53]
Tariq Rahman. 2004. Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In Proceedings of the SCALLA Conference on Computational Linguistics, Vol. 99. 100.
[54]
Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126--135.
[55]
Adriana Roventini, Alone Antonietta, Francesca Bertagna, Nicoletta Calzolari, Cacila Jessica, Girardi Christian, Magnini Bernardo, et al. 2003. ItalWordNet: Building a large semantic database for the automatic treatment of Italian. Linguistica Computazionale 18 (2003), 745--791.
[56]
Hassan Sajid. 2007. Urdu Part of Speech Tagset. Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, Lahore, Pakistan.
[57]
Hinrich Schütze, Christopher D. Manning, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Vol. 39. Cambridge University Press.
[58]
UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507--2518.
[59]
Benjamin Snyder and Martha Palmer. 2004. The English all-words task. In Proceedings of SENSEVAL-3: The 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.
[60]
Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427--437.
[61]
Radu Soricut and Eric Brill. 2004. Automatic question answering: Beyond the factoid. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04).
[62]
Steve Stemler. 2001. An overview of content analysis. Practical Assessment, Research and Evaluation 7, 17 (2001), 137--146.
[63]
Xue-Ren Sun, Shao-He Lv, Xiao-Dong Wang, and Dong Wang. 2017. Chinese word sense disambiguation using a LSTM. In ITM Web of Conferences, Vol. 12. EDP Sciences, 01027.
[64]
Vikas Thada and Vivek Jaglan. 2013. Comparison of Jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology 2, 4 (2013), 202--205.
[65]
Saba Urooj, Sana Shams, Sarmad Hussain, and Farah Adeeba. 2014. Sense Tagged CLE Urdu Digest Corpus. Centre for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan.
[66]
Arthur A. Van Hoff. 1998. System for adding requested document cross references to a document by annotation proxy configured to merge and a directory generator and annotation server. US Patent 5,822,539.
[67]
Piek Vossen, Rubén Izquierdo, and Attila Görög. 2013. DutchSemCor: In quest of the ideal sense-tagged corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’13). 710--718.
[68]
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv:1603.07012.
[69]
Ayesha Zafar, Afia Mahmood, Farhat Abdullah, Saira Zahid, Sarmad Hussain, and Asad Mustafa. 2012. Developing Urdu WordNet using the merge approach. In Proceedings of the Conference on Language and Technology. 55--59.
[70]
Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in Chinese, English, Japanese and Korean? arXiv:1708.02657.

Cited By

View all
  • (2023)Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense DisambiguationADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal10.14201/adcaij.3108412:1(e31084)Online publication date: 1-Nov-2023
  • (2023)Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification MethodsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358249622:6(1-32)Online publication date: 16-Feb-2023
  • (2023)Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-TransformersKI - Künstliche Intelligenz10.1007/s13218-022-00793-337:1(55-67)Online publication date: 20-Jan-2023
  • Show More Cited By

Index Terms

  1. A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 4
    December 2019
    305 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3327969
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2019
    Accepted: 01 February 2019
    Revised: 01 November 2018
    Received: 01 August 2018
    Published in TALLIP Volume 18, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Word sense disambiguation
    2. all-words task
    3. sense tagged Urdu corpus

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • AHRC and ESRC

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)209
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 28 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense DisambiguationADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal10.14201/adcaij.3108412:1(e31084)Online publication date: 1-Nov-2023
    • (2023)Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification MethodsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358249622:6(1-32)Online publication date: 16-Feb-2023
    • (2023)Identifying Landscape Relevant Natural Language using Actively Crowdsourced Landscape Descriptions and Sentence-TransformersKI - Künstliche Intelligenz10.1007/s13218-022-00793-337:1(55-67)Online publication date: 20-Jan-2023
    • (2023)Advances Toward Word-Sense DisambiguationInternational Conference on Innovative Computing and Communications10.1007/978-981-99-4071-4_15(177-189)Online publication date: 26-Oct-2023
    • (2022)A Semantic Similarity-Based Identification Method for Implicit Citation Functions and Sentiments InformationInformation10.3390/info1311054613:11(546)Online publication date: 17-Nov-2022
    • (2022)CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learningIntelligent Systems with Applications10.1016/j.iswa.2022.20012916(200129)Online publication date: Nov-2022
    • (2021)Extraction of Opinion Target Using Syntactic Rules in Urdu TextIntelligent Automation & Soft Computing10.32604/iasc.2021.01857229:3(839-853)Online publication date: 2021
    • (2021)Investigating the Feasibility of Deep Learning Methods for Urdu Word Sense DisambiguationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347757821:2(1-16)Online publication date: 31-Oct-2021
    • (2021)UrduAI: Writeprints for Urdu Authorship IdentificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347646721:2(1-18)Online publication date: 31-Oct-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media