Abstract
Part of speech (POS) tagging, the assignment of syntactic categories for words in running text, is significant to natural language processing as a preliminary task in applications such as speech processing, information extraction, and others. Urdu language processing presents a challenge due to the dual behaviour of various Urdu POS tags in differing situations (morphosyntactic ambiguity). This paper addresses this challenge by developing a novel tagging approach using linear-chain conditional random fields (CRF). Our work is the first instance of a CRF approach for Urdu POS tagging. The proposed model employs a strong, stable and balanced language-independent as well as language dependent feature set. The language-dependent feature considered includes part-of-speech tag of the previous word and suffix of the current word while the language-independent features includes the ‘context words window’. Our approach was evaluated against support vector machine techniques for Urdu POS—considered as state of the art—on two benchmark datasets. The results show our CRF approach to improve upon the F-measure of prior attempts by 8.3–8.5%.
Similar content being viewed by others
References
Abbas, Q. (2014). Semi-semantic part of speech annotation and evaluation. In Proceeding of the 8th linguistic annotation workshop, Dublin, Ireland, August 23–24 2014 (pp. 75–81).
Adeeba, F., & Hussain, S. (2011). Experiences in building the Urdu WordNet. In Proceedings of the 9th workshop on Asian language resources collocated with IJCNLP, Chiang Mai, Thailand (pp. 31–35).
Ahmed, T., & Hautli, A. (2011). A first approach towards an Urdu WordNet. Linguistics and Literature Review, 1(1), 1–14.
Akram, Q.-U.-A., Naseer, A., & Hussain, S. (2009). Assas-band, an affix-exception-list based Urdu stemmer. In The 7th workshop on Asian language resources (pp. 40–46). Association for Computational Linguistics.
Anwar, W., Wang, X., Li, L., & Wang, X.-L. (2007). A statistical based part of speech tagger for Urdu language. In International conference on machine learning and cybernetics (Vol. 6, pp. 3418–3424). IEEE.
Anwar, W., Wang, X., & Wang, X.-L. (2006). A survey of automatic Urdu language processing. In International conference on machine learning and cybernetics (pp. 4489–4494). IEEE.
Atwell, E. S. (2008). Development of tag sets for part-of-speech tagging. In Anke Lüdeling (Ed.), Corpus Linguistics: An International Handbook (Vol. 1, pp. 501–526). Walter de Gruyter.
Benajiba, Y., & Rosso, P. (2008). Arabic named entity recognition using conditional random fields. In Proceedings of workshop on HLT & NLP within the Arabic World, LREC (Vol. 8, pp. 143–153). Citeseer.
Biemann, C. (2006). Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop (pp. 7–12). Association for Computational Linguistics.
Bilgin, M., & Amasyali, M. F. (2016). Semantic role labeling with relative clauses. International Journal of Electronics, Mechanical and Mechatronics Engineering, 6(2), 1165–1175.
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on speech and natural language (pp. 112–116). Association for Computational Linguistics.
Brill, E. (1993). Automatic grammar induction and parsing free text: A transformation-based approach. In Proceedings of the workshop on human language technology (pp. 237–242). Association for Computational Linguistics.
Brill, E. (1994a). A report of recent progress in transformation-based error-driven learning. In Proceedings of the workshop on human language technology (pp. 256–261). Association for Computational Linguistics.
Brill, E. (1994b). Some advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (pp. 722–727).
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.
Daniel, J., & James, H. (2009). Speech and Language processing: An introduction to natural language processing. In Computational linguistics and speech recognition (2nd ed.). Englewood Cliffs: Prentice Hall.
Daud, A., Khan, W., & Che, D. (2016). Urdu language processing: A survey. Artificial Intelligence Review. https://doi.org/10.1007/s10462-016-9482-x.
Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation. Citeseer.
Graça, J. V., Ganchev, K., Coheur, L., Pereira, F., & Taskar, B. (2011). Controlling complexity in part-of-speech induction. Journal of Artificial Intelligence Research, 41, 527–551.
Haq, M. A. (1987). اردو صرف و نخو: Amjuman-e-Taraqqi Urdu (Hind) New Delhi.
Hardie, A. (2003). Developing a tagset for automated part-of-speech tagging in Urdu. In The Corpus linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of Linguistics, Lancaster University.
Hoefel, G., & Elkan, C. (2008). Learning a two-stage SVM/CRF sequence classifier. In The 17th ACM conference on information and knowledge management (pp. 271–278). ACM.
Husain, M. S., Ahamad, F., & Khalid, S. (2013). A language independent approach to develop Urdu stemmer. In: Meghanathan N., Nagamalai D., Chaki N. (eds) Advances in Computing and Information Technology. Advances in Intelligent Systems and Computing, vol 178. Springer, Berlin, Heidelberg.
Ijaz, M., & Hussain, S. (2007). Corpus based Urdu lexicon development. In Proceedings of the conference on language technology (CLT07), University of Peshawar, Pakistan (Vol. 73).
Jawaid, B., Kamran, A., & Bojar, O. (2014). A tagged Corpus and a tagger for Urdu. In LREC (pp. 2938–2943).
Jawaid, B., & Ondřej, B. (2012). Tagger voting for Urdu. In 24th international conference on computational linguistics (p. 135). Citeseer.
Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016a). A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait Journal of Science, 43(4), 66–84.
Khan, W., Daud, A., Nasir, J. A., & Amjad, T. (2016b). Urdu named entity dataset for Urdu named entity recognition task. In 6th International conference on language & technology (pp. 51–55).
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Paper presented at the eighteenth international conference on machine learning, ICML.
Maimaiti, M., Wumaier, A., Abiderexiti, K., & Yibulayin, T. (2017). Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging. Information, 8(4), 157.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Muaz, A., Ali, A., & Hussain, S. (2009). Analysis and development of Urdu POS tagged corpus. In Proceedings of the 7th workshop on Asian language resources (pp. 24–29). Association for Computational Linguistics.
Mukund, S. (2012). An NLP framework for non-topical text analysis in Urdu—A resource poor language. ProQuest LLC., Ph.D. Dissertation, State University of New York at Buffalo.
Mukund, S., Srihari, R., & Peterson, E. (2010). An information-extraction system for Urdu—A resource-poor language. ACM Transactions on Asian Language Information Processing (TALIP), 9(4), 1–43.
Naz, F., Anwar, W., Bajwa, U. I., & Munir, E. U. (2012). Urdu part of speech tagging using transformation based error driven learning. World Applied Sciences Journal, 16(3), 437–448.
Phuong, N. D., & Chau, V. T. N. (2016). Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach. In Computing & communication technologies, research, innovation, and vision for the future (RIVF), 2016 IEEE RIVF international conference on (pp. 43–48). IEEE.
Platts, J. T. (1909). A grammar of the Hindustani or Urdu language. London: WH Allen.
Raymond, C., & Riccardi, G. (2007). Generative and discriminative algorithms for spoken language understanding. In INTERSPEECH (pp. 1605–1608).
Roth, D., & Zelenko, D. (1998). Part of speech tagging using a network of linear separators. In Proceedings of the 17th international conference on computational linguistics—Volume 2 (pp. 1136–1142). Association for Computational Linguistics.
Saha, S. K., Sarkar, S., & Mitra, P. (2008). A hybrid feature set based maximum entropy Hindi named entity recognition. In IJCNLP (pp. 343–349).
Sajjad, H. (2007). Statistical part of speech tagger for Urdu. Unpublished Master’s Thesis. National University of Computer & Emerging Sciences. Lahore, Pakistan.
Sajjad, H., & Schmid, H. (2009). Tagging Urdu text with parts of speech: A tagger comparison. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics (pp. 692–700). Association for Computational Linguistics.
Schmidt, R. L. (1999). Urdu: An Essential Grammar. London: Routledge Publishing.
Sharjeel, M., Nawab, R. M. A., & Rayson, P. (2017). COUNTER: Corpus of Urdu news text reuse. Language Resources and Evaluation, 51(3), 777–803.
Silfverberg, M., Ruokolainen, T., Linden, K., & Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers) 2014, Baltimore, Maryland.
Song, S., Zhang, N., & Huang, H. (2017). Named entity recognition based on conditional random fields. Cluster Computing. https://doi.org/10.1007/s10586-017-1146-3
Tafseer, A., Urooj, S., Hussain, S., Mustafa, A., Parveen, R., Adeeba, F., et al. (2015). The CLE Urdu POS tagset. In LREC 2014, ninth international conference on language resources and evaluation (pp. 2920–2925).
Yin, Y., Wei, F., Dong, L., Xu, K., Zhang, M., & Zhou, M. (2016). Unsupervised word and dependency path embeddings for aspect term extraction. arXiv preprint arXiv:1605.07843.
Žitnik, S., Šubelj, L., & Bajec, M. (2014). SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields. PLoS ONE, 9(6), 1–14.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan, W., Daud, A., Nasir, J.A. et al. Urdu part of speech tagging using conditional random fields. Lang Resources & Evaluation 53, 331–362 (2019). https://doi.org/10.1007/s10579-018-9439-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9439-6