Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1571941.1571957acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Addressing morphological variation in alphabetic languages

Published: 19 July 2009 Publication History

Abstract

The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not finding documents that contain words related to query terms through infectional or derivational morphology. However, rule-based stemmers are not available for every language and it is unclear which methods for coping with morphology are most effective. In this paper we investigate an assortment of techniques for representing text and compare these approaches using data sets in eighteen languages and five different writing systems.
We find character n-gram tokenization to be highly effective. In half of the languages examined n-grams outperform unnormalized words by more than 25%; in highly infective languages relative improvements over 50% are obtained. In languages with less morphological richness the choice of tokenization is not as critical and rule-based stemming can be an attractive option, if available. We also conducted an experiment to uncover the source of n-gram power and a causal relationship between the morphological complexity of a language and n-gram effectiveness was demonstrated.

References

[1]
P. Ahlgren and J. Kekalainen. Indexing strategies for Swedish full text retrieval under different user scenarios. Information Processing and Management, 43(1):81--102, 2007.
[2]
A. Chen, J. He, L. Xu, Gey, F. C., and J. Meggs. Chinese text retrieval without using a dictionary. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1997.
[3]
G. V. Cormack and T. R. Lynam. Validity and power of t-test for comparing MAP and GMAP. In Proceedings of ACM SIGIR, pages 753--754, 2007.
[4]
M. Creutz and K. Lagus. Unsupervised discovery of morphemes. In ACL-02 Workshop on Morphological and Phonological Learning, pages 21--30, 2002.
[5]
S. Foo and H. Li. Chinese word segmentation and its effect on information retrieval. Information Processing and Management, 40(1):161 -- 190, 2004.
[6]
M. A. Hafer and S. F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10(11/12):371--385, 1974.
[7]
D. Harman. How effective is stemming? JASIS, 42(1):7--15, 1991.
[8]
D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University of Twente, 2001.
[9]
V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monolingual document retrieval for European languages. Information Retrieval, 7(1-2):33--52, 2004.
[10]
D. A. Hull. Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1):70--84, 1996.
[11]
A. Jarvelin, A. Jarvelin, and K. Jarvelin. S-grams: Defining generalized n-grams for information retrieval. Information Processing and Management, 43(4):1005--1019, 2007.
[12]
P. Juola. Measuring linguistic complexity: the morphological tier. Journal of Quantitative Linguistics, 5(3):206--213, 1998.
[13]
K. Kettunen, M. Sadeniemi, T. Lindh-Knuutila, and T. Honkela. Analysis of EU languages through text compression. In FinTAL, pages 99--109, 2006.
[14]
D. E. Knuth. Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition). Addison-Wesley Professional, April 1998.
[15]
R. Krovetz. Viewing morphology as an inference process. In ACM SIGIR 1993, pages 191--202, 1993.
[16]
M. Kurimo, M. Creutz, and V. Turunen. Overview of Morpho Challenge in CLEF 2007. In Working Notes of the CLEF 2007 Workshop, 2007.
[17]
J. Mayfield and P. McNamee. Single n--gram stemming. In Proceedings of ACM SIGIR, pages 415--416, 2003.
[18]
P. McNamee. Textual Representations for Corpus-Based Bilingual Retrieval. PhD thesis, University of Maryland Baltimore County, Baltimore, MD, 2008.
[19]
P. McNamee and J. Mayfield. Character n-gram tokenization for European language text retrieval. Information Retrieval, 7(1-2):73--97, 2004.
[20]
D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214--221, New York, NY, USA, 1999. ACM.
[21]
S. H. Mustafa. Character contiguity in n-gram based word matching: the case for Arabic text searching. Information Processing and Management, 41:819--827, 2004.
[22]
Y. Ogawa and T. Matsuda. Overlapping statistical word indexing: A new indexing method for Japanese text. In SIGIR, pages 226--234. ACM, 1997.
[23]
J. Savoy. Light stemming approaches for the French, Portuguese, German and Hungarian languages. In SAC '06: Proceedings of the 2006 ACM symposium on applied computing, pages 1031--1035, New York, NY, USA, 2006. ACM.
[24]
J. Xu and W. B. Croft. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst., 16(1):61--81, 1998.
[25]
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software -- Practice and Experience, 25(3):331--345, 1995.

Cited By

View all
  • (2023)An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing SystemsIEEE Access10.1109/ACCESS.2023.333271011(133681-133702)Online publication date: 2023
  • (2021)Morphological variations of languages in selected towns of the Fifth District of LeyteInternational Journal of Research Studies in Education10.5861/ijrse.2021.a05810:13Online publication date: 23-Aug-2021
  • (2019)The Challenges of Language Variation in Information AccessInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_8(201-216)Online publication date: 14-Aug-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CLIR
  2. character n-grams
  3. morphology
  4. stemming
  5. tokenization

Qualifiers

  • Research-article

Conference

SIGIR '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)19
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing SystemsIEEE Access10.1109/ACCESS.2023.333271011(133681-133702)Online publication date: 2023
  • (2021)Morphological variations of languages in selected towns of the Fifth District of LeyteInternational Journal of Research Studies in Education10.5861/ijrse.2021.a05810:13Online publication date: 23-Aug-2021
  • (2019)The Challenges of Language Variation in Information AccessInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_8(201-216)Online publication date: 14-Aug-2019
  • (2018)Deep Learning of Inflection and the Cell-Filling ProblemItalian Journal of Computational Linguistics10.4000/ijcol.5404:1(57-75)Online publication date: 1-Jun-2018
  • (2018)Addressing The Privacy Paradox through Personalized Privacy NotificationsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32142712:2(1-25)Online publication date: 5-Jul-2018
  • (2017)An Efficient Corpus-Based StemmerCognitive Computation10.1007/s12559-017-9479-z9:5(671-688)Online publication date: 7-Jun-2017
  • (2016)Information retrieval from historical newspaper collections in highly inflectional languagesJournal of the Association for Information Science and Technology10.1002/asi.2337967:12(2928-2946)Online publication date: 1-Dec-2016
  • (2015)Data Fusion for Japanese Term and Character N-gram SearchProceedings of the 20th Australasian Document Computing Symposium10.1145/2838931.2838939(1-4)Online publication date: 8-Dec-2015
  • (2014)Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?Journal of Quantitative Linguistics10.1080/09296174.2014.91150621:3(223-245)Online publication date: 17-Jun-2014
  • (2013)Bengali (Bangla) Information RetrievalTechnical Challenges and Design Issues in Bangla Language Processing10.4018/978-1-4666-3970-6.ch012(273-301)Online publication date: 2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media