Abstract
Computational Morphology is an urgent problem for Arabic Natural Language Processing, because Arabic is a highly inflected language. We have found, however, that a full solution to this problem is not required for effective information retrieval. Light stemming allows remarkably good information retrieval without providing correct morphological analyses. We developed several light stemmers for Arabic, and assessed their effectiveness for information retrieval using standard TREC data. We have also compared light stemming with several stemmers based on morphological analysis. The light stemmer, light10, outperformed the other approaches. It has been included in the Lemur toolkit, and is becoming widely used Arabic information retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abu-Salem, H., Al-Omari, M., and Evens, M. Stemming methodologies over individual query words for Arabic information retrieval. JASIS, 50 (6), pp. 524–529, 1999.
Al-Fedaghi, S. S. and Al-Anzi, F. S. A new algorithm to generate Arabic root-pattern forms. In Proceedings of the 11th national computer conference. King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, pp. 391–400, 1989.
Aljlayl, M., Beitzel, S., Jensen, E., Chowdhury, A., Holmes, D., Lee, M., Grossman, D., and Frieder, O. IIT at TREC-10. In TREC 2001. Gaithersburg: NIST, pp. 265–275, 2001.
Al-Kharashi, I. and Evens, M. W. Comparing words, stems, and roots as index terms in an Arabic information retrieval system. JASIS, 45 (8), pp. 548–560, 1994.
Allan, J., Callan, J., Collins-Thompson, K., Croft, B., Feng, F., Fisher, D., Lafferty, J., Larkey, L., Truong, T. N., Ogilvie, P., Si, L., Strohman, T., Turtle, H., and Zhai, C. The Lemur toolkit for language modeling and information retrieval. http://www.lemurproject.org/lemur
Al-Shalabi, R. Design and implementation of an Arabic morphological system to support natural language processing. PhD thesis, Computer Science, Illinois Institute of Technology, Chicago, 1996.
Beesley, K. R. Arabic finite-state morphological analysis and generation. In COLING-96: Proceedings of the 16th international conference on computational linguistics, vol. 1, pp. 89–94, 1996.
Berlian, V., Vega, S. N., and Bressan, S. Indexing the Indonesian web: Language identification and miscellaneous issues. Presented at Tenth International World Wide Web Conference, Hong Kong, 2001.
Brent, M. R. Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Science, 3 (8), pp. 294–301, 1999.
Buckwalter, T. Qamus: Arabic lexicography. http://www.qamus.org/
Callan, J. P., Croft, W. B., and Broglio, J. TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31 (3), pp. 327–343, 1995.
Carlberger, J., Dalianis, H., Hassel, M., and Knutsson, O. Improving precision in information retrieval for Swedish using stemming. In Proceedings of NODALIDA ’01 - 13th Nordic conference on computational linguistics. Uppsala, Sweden, 2001. http://www.nada.kth.se/∼xmartin/papers/Stemming_NODALIDA01.pdf
Chen, A. and Gey, F. Building an Arabic stemmer for information retrieval. In TREC 2002. Gaithersburg: NIST, pp 631–639, 2002.
Darwish, K. Building a shallow morphological analyzer in one day. ACL 2002 Workshop on Computational Approaches to Semitic languages, pp. 47–54, July 11, 2002.
Darwish, K., Doermann, D., Jones, R., Oard, D., and Rautiainen, M. TREC-10 experiments at Maryland: CLIR and video. In TREC 2001. Gaithersburg: NIST, pp 549–562, 2001.
Darwish, K. and Oard, D.W. CLIR Experiments at Maryland for TREC-2002: Evidence combination for Arabic-English retrieval. In TREC 2002. Gaithersburg: NIST, pp 703–710, 2002.
de Marcken, C. Unsupervised language acquisition. PhD thesis, MIT, Cambridge, 1995.
De Roeck, A. N. and Al-Fares, W. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings ACL-2000. Hong Kong, pp 199–206, 2000.
Diab, M. ArabicSVMTools. http://www.stanford.edu/∼mdiab/software/ArabicSVMTools.tar.gz. 2004.
Diab, M., Hacioglu, K., and Jurafsky, D. Automatic tagging of Arabic text: From raw test to base phrase chunks. In Proceedings of HLT-NAACL, pp 149–152, 2004. http://www.stanford.edu/∼mdiab/papers/ArabicChunks.pdf.
Ekmekcioglu, F. C., Lynch, M. F., and Willett, P. Stemming and n-gram matching for term conflation in Turkish texts. Information Research News, 7 (1), pp. 2–6, 1996.
Flenner, G. Ein quantitatives Morphsegmentierungssytem fur Spanische Wortformen. In Computatio linguae II, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 31–62, 1994.
Frakes, W. B. Stemming algorithms. In Information retrieval: Data structures and algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Englewood Cliffs, NJ: Prentice Hall, Chapter 8, 1992.
Freund, E. and Willett, P. Online identification of word variants and arbitrary truncation searching using a string similarity measure. Information Technology: Research and Development, 1, pp. 177–187, 1982.
Gey, F. C. and Oard, D. W. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In TREC 2001. Gaithersburg: NIST, pp 16–26, 2002.
Goldsmith, J. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27 (2), pp. 153–198, 2000.
Goldsmith, J., Higgins, D., and Soglasnova, S. Automatic language-specific stemming in information retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 273–283, 2001.
Goweder, A. and De Roeck, A. Assessment of a significant Arabic corpus. Presented at the Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France, 2001. http://www.elsnet.org/arabic2001/goweder.pdf
Greengrass, M., Robertson, A. M., Robyn, S., and Willett, P. Processing morphological variants in searches of Latin text. Information Research News, 6 (4), pp. 2–5, 1996.
Hafer, M. A. and Weiss, S. F. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10, pp. 371–385, 1974.
Hull, D. A. Stemming algorithms - a case study for detailed evaluation. JASIS, 47 (1), pp. 70–84, 1996.
Janssen, A. Segmentierung Franzosischer Wortformen in Morphe ohne Verwendung eines Lexikons. In Computatio linguae, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp. 74–95, 1992.
Khoja, S. and Garside, R. Stemming Arabic text. Computing Department, Lancaster University, Lancaster, 1999. http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps
Klenk, U. Verfahren morphologischer Segmentierung und die Wortstruktur im Spanischen. In Computatio Linguae, Aufsätze zur algorithmischen und quantitativen Analyse der Sprache, U. Klenk, Ed. Stuttgart: Steiner Verlag, pp 110–124, 1992.
Kraaij, W. and Pohlmann, R. Viewing stemming as recall enhancement. In Proceedings of ACM SIGIR96. pp. 40–48, 1996.
Krovetz, R. Viewing morphology as an inference process. In Proceedings of ACM SIGIR93, pp. 191–203, 1993.
Larkey, Leah S., Ballesteros, L., and Connell, M. (2002) Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis In Proceedings of the 25th annual international conference on research and development in information retrieval (SIGIR 2002), Tampere, Finland, August 11–15, 2002, pp. 275–282.
Larkey, L. S. and Connell, M. E. Arabic information retrieval at UMass in TREC-10. In TREC 2001. Gaithersburg: NIST, 2001.
LDC, Linguistic Data Consortium. Buckwalter Morphological Analyzer Version 1.0, LDC2002L49, 2002. http://www.ldc.upenn.edu/Catalog/.
LDC, Linguistic Data Consortium. Arabic Penn TreeBank 1, v2.0. LDC2003T06, 2003. http://www.ldc.upenn.edu/Catalog/
Lovins, J. B. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, pp. 22–31, 1968.
Mayfield, J., McNamee, P., Costello, C., Piatko, C., and Banerjee, A. JHU/APL at TREC 2001: Experiments in filtering and in Arabic, video, and web retrieval. In TREC 2001. Gaithersburg: NIST, pp 332–341, 2001.
McNamee, P., Mayfield, J., and Piatko, C. A language-independent approach to European text retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 129–139, 2000.
Monz, C. and de Rijke, M. Shallow morphological analysis in monolingual information retrieval for German and Italian. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2001 workshop, C. Peters, Ed.: Springer Verlag, 2001. http://staff.science.uva.nl/∼christof/Papers/clef-2001-post.pdf
Moulinier, I., McCulloh, A., and Lund, E. West group at CLEF 2000: Non-English monolingual retrieval. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.
Oard, D. W., Levow, G. -A., and Cabezas, C. I. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop, C. Peters, Ed.: Springer Verlag, pp. 176–187, 2001.
NIST. Topic Detection and Tracking Resources. http://www.nist.gov/speech/tests/tdt/resources.htm. Created 2000, updated 2002.
Pirkola, A. Morphological typology of languages for IR. Journal of Documentation, 57 (3), pp. 330–348, 2001.
Popovic, M. and Willett, P. The effectiveness of stemming for natural-language access to Slovene textual data. JASIS, 43 (5), pp. 384–390, 1992.
Porter, M. F. An algorithm for suffix stripping. Program, 14 (3), pp. 130–137, 1980.
Rogati, M., McCarley, S., and Yang, Y. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings ACL-2003, Sapporo, Japan, pp. 391–398, July 2003. http://acl.ldc.upenn.edu/acl2003/main/pdf/Rogati.pdf
Siegel, S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956.
Taghva, K., Elkoury, R., and Coombs, J. Arabic Stemming without a root dictionary. 2005. www.isri.unlv.edu/publications/isripub/Taghva2005b.pdf
Tai, S. Y., Ong, C. S., and Abdullah, N. A. On designing an automated Malaysian stemmer for the Malay language. (poster). In Proceedings of the fifth international workshop on information retrieval with Asian languages, Hong Kong, pp. 207–208, 2000.
Xu, J. and Croft, W. B. Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16 (1), pp. 61–81, 1998.
Xu, J., Fraser, A., and Weischedel, R. TREC 2001 cross-lingual retrieval at BBN. In TREC 2001. Gaithersburg: NIST, pp 68–78, 2001.
Xu, J., Fraser, A., and Weischedel, R. Empirical studies in strategies for Arabic retrieval. In Sigir 2002. Tampere, Finland: ACM, pp. 269–274, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer
About this chapter
Cite this chapter
Larkey, L.S., Ballesteros, L., Connell, M.E. (2007). Light Stemming for Arabic Information Retrieval. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_12
Download citation
DOI: https://doi.org/10.1007/978-1-4020-6046-5_12
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-6045-8
Online ISBN: 978-1-4020-6046-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)