research-article

Addressing morphological variation in alphabetic languages

Authors:

Charles Nicholas,

James MayfieldAuthors Info & Claims

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 75 - 82

https://doi.org/10.1145/1571941.1571957

Published: 19 July 2009 Publication History

Abstract

The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not finding documents that contain words related to query terms through infectional or derivational morphology. However, rule-based stemmers are not available for every language and it is unclear which methods for coping with morphology are most effective. In this paper we investigate an assortment of techniques for representing text and compare these approaches using data sets in eighteen languages and five different writing systems.

We find character n-gram tokenization to be highly effective. In half of the languages examined n-grams outperform unnormalized words by more than 25%; in highly infective languages relative improvements over 50% are obtained. In languages with less morphological richness the choice of tokenization is not as critical and rule-based stemming can be an attractive option, if available. We also conducted an experiment to uncover the source of n-gram power and a causal relationship between the morphological complexity of a language and n-gram effectiveness was demonstrated.

References

[1]

P. Ahlgren and J. Kekalainen. Indexing strategies for Swedish full text retrieval under different user scenarios. Information Processing and Management, 43(1):81--102, 2007.

Digital Library

[2]

A. Chen, J. He, L. Xu, Gey, F. C., and J. Meggs. Chinese text retrieval without using a dictionary. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1997.

Digital Library

[3]

G. V. Cormack and T. R. Lynam. Validity and power of t-test for comparing MAP and GMAP. In Proceedings of ACM SIGIR, pages 753--754, 2007.

Digital Library

[4]

M. Creutz and K. Lagus. Unsupervised discovery of morphemes. In ACL-02 Workshop on Morphological and Phonological Learning, pages 21--30, 2002.

Digital Library

[5]

S. Foo and H. Li. Chinese word segmentation and its effect on information retrieval. Information Processing and Management, 40(1):161 -- 190, 2004.

Digital Library

[6]

M. A. Hafer and S. F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10(11/12):371--385, 1974.

[7]

D. Harman. How effective is stemming? JASIS, 42(1):7--15, 1991.

[8]

D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University of Twente, 2001.

[9]

V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monolingual document retrieval for European languages. Information Retrieval, 7(1-2):33--52, 2004.

Digital Library

[10]

D. A. Hull. Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1):70--84, 1996.

Digital Library

[11]

A. Jarvelin, A. Jarvelin, and K. Jarvelin. S-grams: Defining generalized n-grams for information retrieval. Information Processing and Management, 43(4):1005--1019, 2007.

Digital Library

[12]

P. Juola. Measuring linguistic complexity: the morphological tier. Journal of Quantitative Linguistics, 5(3):206--213, 1998.

[13]

K. Kettunen, M. Sadeniemi, T. Lindh-Knuutila, and T. Honkela. Analysis of EU languages through text compression. In FinTAL, pages 99--109, 2006.

Digital Library

[14]

D. E. Knuth. Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition). Addison-Wesley Professional, April 1998.

Digital Library

[15]

R. Krovetz. Viewing morphology as an inference process. In ACM SIGIR 1993, pages 191--202, 1993.

Digital Library

[16]

M. Kurimo, M. Creutz, and V. Turunen. Overview of Morpho Challenge in CLEF 2007. In Working Notes of the CLEF 2007 Workshop, 2007.

Digital Library

[17]

J. Mayfield and P. McNamee. Single n--gram stemming. In Proceedings of ACM SIGIR, pages 415--416, 2003.

Digital Library

[18]

P. McNamee. Textual Representations for Corpus-Based Bilingual Retrieval. PhD thesis, University of Maryland Baltimore County, Baltimore, MD, 2008.

Digital Library

[19]

P. McNamee and J. Mayfield. Character n-gram tokenization for European language text retrieval. Information Retrieval, 7(1-2):73--97, 2004.

Digital Library

[20]

D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214--221, New York, NY, USA, 1999. ACM.

Digital Library

[21]

S. H. Mustafa. Character contiguity in n-gram based word matching: the case for Arabic text searching. Information Processing and Management, 41:819--827, 2004.

Digital Library

[22]

Y. Ogawa and T. Matsuda. Overlapping statistical word indexing: A new indexing method for Japanese text. In SIGIR, pages 226--234. ACM, 1997.

Digital Library

[23]

J. Savoy. Light stemming approaches for the French, Portuguese, German and Hungarian languages. In SAC '06: Proceedings of the 2006 ACM symposium on applied computing, pages 1031--1035, New York, NY, USA, 2006. ACM.

Digital Library

[24]

J. Xu and W. B. Croft. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst., 16(1):61--81, 1998.

Digital Library

[25]

J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software -- Practice and Experience, 25(3):331--345, 1995.

Digital Library

Cited By

Jabbar AIqbal STamimy MRehman ABahaj SSaba T(2023)An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing SystemsIEEE Access10.1109/ACCESS.2023.333271011(133681-133702)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3332710
Borong N(2021)Morphological variations of languages in selected towns of the Fifth District of LeyteInternational Journal of Research Studies in Education10.5861/ijrse.2021.a05810:13Online publication date: 23-Aug-2021
https://doi.org/10.5861/ijrse.2021.a058
Karlgren JHedlund TJärvelin KKeskustalo HKettunen K(2019)The Challenges of Language Variation in Information AccessInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_8(201-216)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_8
Show More Cited By

Index Terms

Addressing morphological variation in alphabetic languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Don't have a stemmer?: be un+concern+ed
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that ...
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
A Basic Language Resource Kit Implementation for the IgboNLP Project

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
693
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)19

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jabbar AIqbal STamimy MRehman ABahaj SSaba T(2023)An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing SystemsIEEE Access10.1109/ACCESS.2023.333271011(133681-133702)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3332710
Borong N(2021)Morphological variations of languages in selected towns of the Fifth District of LeyteInternational Journal of Research Studies in Education10.5861/ijrse.2021.a05810:13Online publication date: 23-Aug-2021
https://doi.org/10.5861/ijrse.2021.a058
Karlgren JHedlund TJärvelin KKeskustalo HKettunen K(2019)The Challenges of Language Variation in Information AccessInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_8(201-216)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_8
Cardillo FFerro MMarzi CPirrelli V(2018)Deep Learning of Inflection and the Cell-Filling ProblemItalian Journal of Computational Linguistics10.4000/ijcol.5404:1(57-75)Online publication date: 1-Jun-2018
https://doi.org/10.4000/ijcol.540
Jackson CWang Y(2018)Addressing The Privacy Paradox through Personalized Privacy NotificationsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32142712:2(1-25)Online publication date: 5-Jul-2018
https://dl.acm.org/doi/10.1145/3214271
Singh JGupta V(2017)An Efficient Corpus-Based StemmerCognitive Computation10.1007/s12559-017-9479-z9:5(671-688)Online publication date: 7-Jun-2017
https://doi.org/10.1007/s12559-017-9479-z
Järvelin AKeskustalo HSormunen ESaastamoinen MKettunen K(2016)Information retrieval from historical newspaper collections in highly inflectional languagesJournal of the Association for Information Science and Technology10.1002/asi.2337967:12(2928-2946)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1002/asi.23379
Yasukawa MCulpepper JScholer F(2015)Data Fusion for Japanese Term and Character N-gram SearchProceedings of the 20th Australasian Document Computing Symposium10.1145/2838931.2838939(1-4)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2838931.2838939
Kettunen K(2014)Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?Journal of Quantitative Linguistics10.1080/09296174.2014.91150621:3(223-245)Online publication date: 17-Jun-2014
https://doi.org/10.1080/09296174.2014.911506
Ganguly DLeveling JJones G(2013)Bengali (Bangla) Information RetrievalTechnical Challenges and Design Issues in Bangla Language Processing10.4018/978-1-4666-3970-6.ch012(273-301)Online publication date: 2013
https://doi.org/10.4018/978-1-4666-3970-6.ch012
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents