Abstract
In this paper we show that keyword variation of a morphologically complex language, Finnish, can be handled effectively for IR purposes by generating only the textually most frequent forms of the keyword. Theoretically Finnish nouns have about 2,000 different forms, but occurrences of most of the forms are rare. Corpus statistics showed that about 84 – 88 per cent of the occurrences of inflected noun forms are forms of only six cases out of the 14 possible. This number – maximally 2*6 – of keyword’s variant forms makes it feasible to try them all in a search. IR results of the frequent keyword form variation coverage were tested with three to twelve keyword variant forms in two test collections, TUTK and CLEF 2003’s Finnish material. The results show that the frequent keyword form generation method competes well with the gold standard, lemmatization, with nine and twelve variant keyword forms.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Popovič, M., Willett, P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. Journal of the American Society for Information Science 43, 384–390 (1992)
Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual Document Retrieval for European Languages. Information Retrieval 7, 33–52 (2004)
Airio, E.: Word Normalization and Decompounding in Mono- and Bilingual IR. Information Retrieval (to appear, 2005)
Koskenniemi, K.: Finite State Morphology and Information Retrieval. Natural Language Engineering 2, 331–336 (1996)
Galvez, C., Moya-Anegón, F., Solana, V.H.: Term Conflation Methods in Information Retrieval. Non-linguistic and Linguistic Approaches. Journal of Documentation 61, 520–547 (2005)
Jacquemin, C., Tzoukerman, E.: NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax. In: Strzralkowski, T. (ed.) Natural Language Information Retrieval, pp. 25–74. Kluwer Academic Publishers, Dordrecht (1999)
Kettunen, K.: Developing an Automatic Linguistic Truncation Operator for Best-match Retrieval in Inflected Word Form Text Database Indexes. Journal of Information Science 32 (to appear, 2006)
Kettunen, K., Kunttu, T., Järvelin, K.: To Stem or Lemmatize a Highly Inflectional Language in a Probabilistic IR Environment? Journal of Documentation 61, 476–496 (2005)
Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7, 291–316 (2004)
Mayfield, J., McNamee, P.: Single N-gram Stemming. In: Proceedings of Sigir 2003. The Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)
Tomlinson, S.: Lexical and algorithmic stemming compared for 9 european languages with hummingbird searchServerTM at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 286–300. Springer, Heidelberg (2004), Availabe at: http://clef.iei.pi.cnr.it/2003/WN_web/19.pdf
Koskenniemi, K.: A System for Generating Finnish Inflected Word Forms. In: Karlsson, F. (ed.) Computational Morphosyntax. Report on research 1981 – 1984, Publications of the Department of General linguistics, University of Helsinki, vol. 13, pp. 63–80 (1985)
Baayen, R.H.: Statistical Models for Word Frequency Distribution. Computers and the Humanities 26, 347–363 (1993)
Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)
Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing 8, 243–257 (1993)
Biber, D.: Using Register-diversified Corpora for General Language Studies. Computational Linguistics 19, 219–241 (1993)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Karlsson, F.: Frequency Considerations in Morphology. Zeitsschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 39, 19–28 (1986)
Karlsson, F.: Defectivity. In: Booij, G., et al. (eds.) Morphology. An International Handbook on Inflection and Word-Formation, Walter de Gruyter, Berlin, vol. 1, pp. 647–654 (2000)
Kostić, A., Marković, T., Baucal, A.: Inflectional Morphology and Word Meaning: Orthogonal or Co-implicative Cognitive Domains. In: Baayen, R.H., Schreuder, R. (eds.) Morphological Structure in Language Processing. Trends in Linguistics, Studies and Monographs, Mouton de Gruyter, Berlin, vol. 151, pp. 1–43 (2003)
Karlsson, F.: Suomen kielen äänne- ja muotorakenne. WSOY, Helsinki (1983)
Räsänen, S.: Havaintoja suomen sijojen frekvensseistä. (Observations of frequencies of the Finnish cases) Sananjalka 21, 17–43 (1979)
Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Iso suomen kielioppi.: Suomalaisen Kirjallisuuden Seura, Helsinki (2004)
Sormunen, E.: A Method for Measuring Wide Range Performance of Boolean Queries in Full-text Databases. Acta Universitatis Tamperensis 748, Tampere (2000)
Creutz, M., Linden, K.: Morpheme Segmentation Gold Standards for Finnish and English. Helsinki University of Technology. Publications in Computer and Information Science. Report A77. Espoo (2004)
Creutz, M.: Two E-mails (May 17, 2005)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, USA (1999)
Saukkonen, P., Haipus, M., Niemikorpi, A., Sulkala, H.: Suomen kielen taajuussanasto (A Frequency Dictionary of Finnish). WSOY, Helsinki (1979)
Kettunen, K.: Sijamuodot haussa - tarvitseeko kaikkea hakutermien morfologista vaihtelua kattaa? Ms. Sci. Thesis, University of Tampere, Department of Information Studies (2005)
Peters, C.: Introduction to the CLEF 2003 Working Notes (accessed September 1, 2005), Available at: http://www.clef-campaign.org/2003/WN_web/00.2%20-%20intro.pdf
Sormunen, E.: The Effectiveness of Free-text Searching in Full-text Databases Containing Newspaper Articles and Abstracts. Research Publications 790. Technical Research Centre of Finland, Espoo (in Finnish, English abstract) (1994)
Holman, E.: Finnmorf: A Computerized Research Tool for Students of Finnish Morphology. Computers and the Humanities 22, 165–172 (1988)
Lassila, E.: Suomen kielen sanamuodot taivuttava ohjelma FORMO. In: Mäkelä, M., Linnainmaa, S., Ukkonen, E. (eds.) STeP 1988. Invited Papers. Contributed Papers: Applications, pp. 118–126. Finnish Artificial Intelligence Society, Helsinki (1988)
Kekäläinen, J.: The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Acta Universitatis Tamperensis 678, Tampere (1999)
Allan, J., Callan, J., Croft, B., Ballesteros, L., Byrd, D., Swan, R., Xu, J.: INQUERY Does Battle with TREC-6. In: Voorhees, E., Harman, D. (eds.) Proceedings of the TREC 6 Conference (1997) (accessed November 15, 2005), Available from: http://trec.nist.gov/pubs/trec6/t6_proceedings.html
Broglio, J., Callan, J., Croft, W.B.: INQUERY System Overview. In: Proceedings of the TIPSTER text program (Phase I). Morgan Kaufmann Publishers, San Francisco (1994)
Jansen, B., Spink, A., Sarasevic, T.: Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. Information Processing & Management 36, 207–227 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kettunen, K., Airio, E. (2006). Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval?. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_42
Download citation
DOI: https://doi.org/10.1007/11816508_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)