Abstract
The objective of this work is to develop a POS tagger for the Arabic language. This analyzer uses a very rich tag set that gives syntactic information about proclitic attached to words. This study employs a probabilistic model and a morphological analyzer to identify the right tag in the context. Most published research on probabilistic analysis uses only a training corpus to search the probable tags for each words, and this sometimes affects their performances. In this paper, we propose a method that takes into account the tags that are not included in the training data. These tags are proposed by the Alkhalil_Morpho_Sys analyzer (Bebah et al. 2011). We show that this consideration increases significantly the accuracy of the morphosyntactic analysis. In addition, the adopted tag set is very rich and it contains the compound tags that allow analyze the proclitics attached to words.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Al Shamsi, F., & Guessoum, A. (2006). A hidden markov model-based POS tagger for Arabic. In Proceedings of the 8th International Conference on the Statistical. Besançon, France.
Al-Taani, A. T., & Al-Rub, S. A. (2009). A rule-based approach for tagging non-vocalized Arabic words. International Arab Journal of Information Technology, 6(3), 320–328.
Altabba, M., Al-Zaraee, A., & Shukairy, M. A. (2010). An Arabic morphological analyzer and part-of-speech tagger. Thesis, Faculty of Informatics Engineering, Arab International University, Damascus.
Antony, P. J., & Soman, K. P. (2011). Parts of speech tagging for Indian languages: A literature survey. International Journal of Computer Applications (0975-8887), 34(8), 22–29.
Atiyya, M., Choukri, K., & Yaseen, M. (2005, September 29). NEMLAR Arabic written corpus. Retrieved June 11, 2015, from http://www.rdi-eg.com/Downloads/Lang%20Tech/Nemlar-specifications-resources-WC-V3.0_Final.doc.
Attia, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Arabic Written Corpus produced within the NEMLAR project. http://www.medar.info/The_Nemlar_Project/Publications/WC_design_final.pdf.
Bebah, M. O. A. O., Meziane, A., Mazroui, A., & Lakhouaja, A. (2011). Alkhalil morpho sys. In 7th International computing conference in Arabic.
Boudchiche, M., Mazroui, M., ould Abdallahi Ould Bebah, M., & Lakhouaja, A. (2014). L’analyseur Morphosyntaxique Alkhali Morpho Sys 2. In 1st National Doctoral Day of Engineering Arabic Language.
Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on speech and natural language (pp. 112–116). Association for Computational Linguistics.
Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No. LDC2002L49. ISBN 1-58563-324-0.
Chalabi, A. (2004). Sakhr Arabic lexicon. In NEMLAR international conference on Arabic language resources and tools (pp. 21–24).
Darwish, K., Abdelali, A., & Mubarak, H. (2014). Using stem-templates to improve Arabic POS and gender/number tagging. In International conference on language resources and evaluation (LREC-2014).
Diab, M. (2009). Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In 2nd International conference on Arabic language resources and tools. Cairo, Egypt.
Diab, M., Hacioglu, K., & Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short papers (pp. 149–152). Association for Computational Linguistics.
El Jihad, A., & Yousfi, A. (2005). Etiquetage morpho-syntaxique des textes arabes par modèle de Markov caché. In Proceedings of Rencontre Des Etudiants Chercheurs En Informatique Pour Le Traitement Automatique Des Langues (pp. 649–654). Dourdan, France
El-Jihad, A., Yousfi, A., & Si-Lhoussain, A. (2011). Morpho-syntactic tagging system based on the patterns words for Arabic texts. International Arab Journal of Information Technology, 8(4), 350–354.
Ghoul, D. (2011). Outils génériques pour l’étiquetage morphosyntaxique de la langue arabe: segmentation et corpus d’entraînement.
Huang, L., Peng, Y., Wang, H., & Wu, Z. (2002). Statistical part-of-speech tagging for classical Chinese. In Text, speech and dialogue (pp. 115–122). Brno
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the student workshop at NAACL (pp. 20–25).
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Nakagawa, T., & Uchimoto, K. (2007). A hybrid approach to word segmentation and POS tagging. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 217–220). Association for Computational Linguistics.
Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text recognition (Corresp.). IEEE Transactions on Information Theory, 21(2), 222–226.
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8(1), 1–38.
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., et al. (2014). A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. Reykjavik: LREC.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing (Vol. 12, pp. 44–49). Manchester.
Thibeault, M. (2004). La catégorisation grammaticale automatique: adaptation du catégoriseur de Brill au français et modification de l’approche. Université Laval.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology (Vol. 1, pp. 173–180). Association for Computational Linguistics.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Ababou, N., Mazroui, A. A hybrid Arabic POS tagging for simple and compound morphosyntactic tags. Int J Speech Technol 19, 289–302 (2016). https://doi.org/10.1007/s10772-015-9302-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9302-8