Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1390749.1390767acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A novel Arabic lemmatization algorithm

Published: 24 July 2008 Publication History

Abstract

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.
Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.
The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.

References

[1]
W. B. Frakes, "Stemming algorithms," 1992.
[2]
I. A. Al-Kharashi, "Micro-AIRS: A microcomputer-based Arabic information retrieval system comparing words, stems, and roots as index terms," 1991.
[3]
I. A. Al-Kharashi and M. W. Evens, "Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System.," Journal of the American Society for Information Science, vol. 45, 1994, pp. 548--60.
[4]
L. S. Larkey and M. E. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proceedings of the Tenth Text REtrieval Conference (TREC-10)", EM Voorhees and DK Harman ed, 2001, pp. 562--570.
[5]
L. S. Larkey, L. Ballesteros, and M. E. Connell, "Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis," Tampere, Finland: ACM, 2002, pp. 275--282.
[6]
J. Xu, A. Fraser, and R. Weischedel, "Empirical studies in strategies for Arabic retrieval," Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 269--274.
[7]
S. Khoja and R. Garside, "Stemming Arabic Text," Lancaster, UK, Computing Department, Lancaster University, 1999.
[8]
R. Duwairi, "A Distance-based Classifier for Arabic Text Categorization," Proceedings of the 2005 International Conference on Data Mining, Las Vegas USA, 2005.
[9]
M. El Kourdi, A. Bensaid, and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," COLING 2004.
[10]
S. H. Mustafa and Q. A. Al-Radaideh, "Using N-grams for Arabic text searching," Journal of the American Society for Information Science and Technology, vol. 55, 2004, pp. 1002--1007.
[11]
R. A. Baeza-Yates, "Text-Retrieval: Theory and Practice," North-Holland Publishing Co., 1992, pp. 465--476.
[12]
"Snowball: A language for stemming algorithms"; http://snowball.tartarus.org/texts/introduction.html.
[13]
S. S. Al-Fedaghi and F. Al-Anzi, "A New Algorithm to Generate Arabic Root-Pattern Forms," Proceedings of the 11th National Computer Conference and Exhibition, 1989, pp. 391--400.
[14]
T. Korenius et al., "Stemming and lemmatization in the clustering of finnish text documents," Washington, D.C., USA: ACM, 2004, pp. 625--633.
[15]
M. BOOT, "Homography and Lemmatization in Dutch Texts," ALLC Bulletin, vol. 8, 1980, pp. 175--189.
[16]
Eiman Al-Shammari and J. Lin, "Automated Corpora Creation Using A novel Arabic Stemming Algorithm," The 2008 International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), Hangzhou, China: 2008.
[17]
A. K. Jain and R. C. Dubes, Algorithms for clustering data, 1988.
[18]
M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," KDD Workshop on Text Mining, vol. 34, 2000, p. 35.
[19]
Y. Zhao and G. Karypis, "Criterion Functions for Document Clustering," Experiments and Analysis University of Minnesota, Department of Computer Science/Army HPC Research Center.
[20]
E. Al-Shammari, "Towards an Error Free Stemming," IADIS European Conference on Data Mining (ECDM 2008), Amsterdam, The Netherlands: 2008.

Cited By

View all
  • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
  • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
  • (2022)Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA55840.2022.9971451(1-6)Online publication date: 4-Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
July 2008
130 pages
ISBN:9781605581965
DOI:10.1145/1390749
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Arabic
  2. lemmatization
  3. stemming
  4. text mining
  5. tokenization

Qualifiers

  • Research-article

Conference

AND '08

Acceptance Rates

Overall Acceptance Rate 15 of 22 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
  • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
  • (2022)Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA55840.2022.9971451(1-6)Online publication date: 4-Nov-2022
  • (2022)ALP: An Arabic Linguistic PipelineAnalysis and Application of Natural Language and Speech Processing10.1007/978-3-031-11035-1_4(67-99)Online publication date: 4-Aug-2022
  • (2021)An Explainable Artificial Intelligence Model for Detecting Xenophobic TweetsApplied Sciences10.3390/app11221080111:22(10801)Online publication date: 16-Nov-2021
  • (2020)Lemmatization Algorithm Development for Bangla Natural Language Processing2020 Joint 9th International Conference on Informatics, Electronics & Vision (ICIEV) and 2020 4th International Conference on Imaging, Vision & Pattern Recognition (icIVPR)10.1109/ICIEVicIVPR48672.2020.9306652(1-8)Online publication date: 26-Aug-2020
  • (2019)Comparing the Effectiveness of the Improved ARLSTem Algorithm with Existing Arabic Light Stemmers2019 International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS)10.1109/ICTAACS48474.2019.8988118(1-8)Online publication date: Dec-2019
  • (2019)A hybrid approach for Arabic lemmatizationInternational Journal of Speech Technology10.1007/s10772-018-9528-322:3(563-573)Online publication date: 1-Sep-2019
  • (2019)Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning TechniqueRecent Advances in NLP: The Case of Arabic Language10.1007/978-3-030-34614-0_5(81-100)Online publication date: 30-Nov-2019
  • (2018)Term Extraction for a Single & Multi-word Based on Islamic Corpus English2018 1st Annual International Conference on Information and Sciences (AiCIS)10.1109/AiCIS.2018.00031(107-111)Online publication date: Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media