Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

Published: 01 March 2011 Publication History

Abstract

Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for English, encoding both inflectional morphology and more complex tokenization sequences. This work takes a new approach to this problem based on a distinction between the open-class and closed-class categories of tokens, which differ both in their frequencies and in their possible morphological affixations. This separation simplifies the morphological analysis problem considerably, making it possible to use a Conditional Random Field model for joint tokenization and “core” part-of-speech tagging of the open-class items, while the closed-class items are handled by regular expressions. This work is therefore situated between data-driven approaches and those that use a morphological analyzer. For the tasks of tokenization and core part-of-speech tagging, the resulting system outperforms, on the given test set, a system that incorporates a morphological analyzer. We also evaluate the effects of the differences on parser performance when the tagger output is used for parser input.

References

[1]
Bikel, D. M. 2004. On the parameter space of lexicalized statistical parsing models. Ph.D. thesis, Department of Computer and Information Sciences, University of Pennsylvania.
[2]
Buckwalter, T. 2004. Buckwalter Arabic morphological analyzer 2.0. Linguistic Data Consortium LDC2004L02.
[3]
Diab, M. 2009. Second generation tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of 2nd International Conference on Arabic Language Resources and Tools (MEDAR’09).
[4]
Diab, M., Hacioglu, K., and Jurafsky, D. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology, A. Soudi, A. van den Bosch, and G. Neumann Eds. Springer, 159--179.
[5]
Green, S. and Manning, C. D. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 394--402.
[6]
Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 573--580.
[7]
Habash, N. and Rambow, O. 2010. Available online at http://www1.ccls.columbia.edu/~cadim/MADA.html.
[8]
Klein, D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL’03). 423--430.
[9]
Kulick, S. 2010. Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10).
[10]
Kulick, S., Bies, A., and Maamouri, M. 2010. Consistent and exible integration of morphological annotation in the Arabic Treebank. In Proceedings of the Conference on Language Resources and Evaluation (LREC’10).
[11]
Maamouri, M., Bies, A., Krouna, S., Gaddeche, F., Bouziri, B., Kulick, S., Mekki, W., and Buckwalter, T. 2009. Arabic treebank morphological and syntactic guidelines. http://projects.ldc.upenn.edu/ArabicTreebank.
[12]
Maamouri, M., Bies, A., Kulick, S., Krouna, S., Gaddeche, F., and Zaghouani, W. 2010. Arabic treebank part 3 - v3.2. Linguistic Data Consortium LDC2010T08.
[13]
Maamouri, M., Graff, D., Bouziri, B., Krouna, S., Bies, A., and Kulick, S. 2010. Standard Arabic morphological analyzer (SAMA) version 3.1. Linguistic Data Consortium LDC2010L01.
[14]
Maamouri, M., Bies, A., and Kulick, S. 2011. Upgrading and enhancing the Penn Arabic treebank: A GALE challenge. In Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, Joseph Olive, Caitlin Christianson, and John McCary Eds., Springer.
[15]
McCallum, A. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
[16]
Roark, B., Harper, M., Charniak, E., Dorr, B., Johnson, M., Kahn, J., Liu, Y., Ostendorf, M., Hale, J., Krasnyanskaya, A., Lease, M., Shafran, I., Snover, M., Stewart, R., and Yung, L. 2006. Sparseval: Evaluation metrics for parsing speech. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06).
[17]
Roth, R., Rambow, O., Habash, N., Diab, M., and Rudin, C. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Association for Computational Linguistics (ACL’08). 117--120.
[18]
Shah, R., Dhillon, P. S., Liberman, M., Foster, D., Maamouri, M., and Ungar, L. 2010. A new approach to lexical disambiguation of Arabic text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 725--735.
[19]
Zitouni, I., Sorensen, J. S., and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics (ACL’06). 577--584.

Cited By

View all
  • (2023)National Payment Switches and the Power of Cognitive Computing against Fintech FraudBig Data and Cognitive Computing10.3390/bdcc70200767:2(76)Online publication date: 17-Apr-2023
  • (2023)Impact of Tokenization on Language Models: An Analysis for TurkishACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357870722:4(1-21)Online publication date: 30-Apr-2023
  • (2019)Arabic Word Segmentation With Long Short-Term Memory Neural Networks and Word EmbeddingIEEE Access10.1109/ACCESS.2019.28934607(12879-12887)Online publication date: 2019

Index Terms

  1. Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian Language Information Processing
      ACM Transactions on Asian Language Information Processing  Volume 10, Issue 1
      March 2011
      88 pages
      ISSN:1530-0226
      EISSN:1558-3430
      DOI:10.1145/1929908
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 March 2011
      Accepted: 01 November 2010
      Revised: 01 August 2010
      Received: 01 June 2010
      Published in TALIP Volume 10, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Arabic
      2. morphological analysis

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)National Payment Switches and the Power of Cognitive Computing against Fintech FraudBig Data and Cognitive Computing10.3390/bdcc70200767:2(76)Online publication date: 17-Apr-2023
      • (2023)Impact of Tokenization on Language Models: An Analysis for TurkishACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357870722:4(1-21)Online publication date: 30-Apr-2023
      • (2019)Arabic Word Segmentation With Long Short-Term Memory Neural Networks and Word EmbeddingIEEE Access10.1109/ACCESS.2019.28934607(12879-12887)Online publication date: 2019

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media