Abstract
Observations of word co-occurrences and similarity computations are often used as a straightforward way to represent the global contexts of words and achieve a simulation of semantic word similarity for applications such as word or document clustering and collocation extraction. Despite the simplicity of the underlying model, it is necessary to select a proper significance, a similarity measure and a similarity computation algorithm. However, it is often unclear how the measures are related and additionally often dimensionality reduction is applied to enable the efficient computation of the word similarity. This work presents a linear time complexity approximative algorithm for computing word similarity without any dimensionality reduction. It then introduces a large-scale evaluation based on two languages and two knowledge sources and discusses the underlying reasons for the relative performance of each measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Finch, S.P.: Finding Structure in Language. PhD thesis, University of Edinburgh, Edinburgh, Scotland, UK (1993)
Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Swedish Intitute of Computer Science, Stockholm, Sweden (2006)
Smadja, F.: Retrieving collocations from text: Xtract. Computational Linguistics 19, 43–177 (1993)
Lin, D.: Extracting collocations from text corpora. In: Proceedings of the First Workshop on Computational Terminology (1998)
Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD thesis, University of Stuttgart, Stuttgart, Germany (2004)
Kilgarriff, A., et al.: The sketch engine. In: Proceedings of Euralex, Lorient, France, pp. 105–116 (2004)
Riloff, E., Shepherd, J.: A corpus-based approach for building semantic lexicons. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP, Somerset, NJ, USA, Association for Computational Linguistics (ACL 1997) pp. 117–124 (1997)
Roark, B., Charniak, E.: Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction. In: Proceedings of The 17th International Conference on Computational Linguistics (COLING/ACL), Montreal, Quebec, Canada, pp. 1110–1116 (1998)
Widdows, D.: Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In: Proceedings of the Human Language Technology Conference (HLT) of the NAACL, Edmonton, Canada, pp. 276–283 (2003)
Rohwer, R., Freitag, D.: Towards full automation of lexicon construction. In: Proceedings of Computational Lexical Semantics Workshop at the HLT/NAACL, Boston, MA, USA (2004)
Dumais, S.T.: Latent semantic indexing (LSI). In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC), Gaithersburg, MD, USA, National Institute of Standards and Technology, pp. 219–230 (1995)
Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston (1994)
Church, K.W., et al.: Using statistics in lexical analysis. In: Zernik, U. (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build up a Lexicon, pp. 115–164. Lawrence Erlbaum, Hillsdale (1991)
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74 (1993)
Lee, L.: Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), College Park, MD, USA, pp. 25–32 (1999)
Holtsberg, A., Willners, C.: Statistics for sentential co-occurrence. Working Papers 48, 135–148 (2001)
Quasthoff, U., Wolff, C.: The poisson collocation measure and its applications. In: Second International Workshop on Computational Approaches to Collocations, Vienna, Austria (2002)
Curran, J.R.: From Distributional to Semantic Similarity. PhD thesis, Institute for Communicating and Collaborative Systems, School of Informatics. University of Edinburgh, Edinburgh, Scotland, UK (2003)
Terra, E., Clarke, C.L.A.: Frequency estimates for statistical word similarity measures. In: Proceedings of the Human Language Technology Conference (HLT) of the NAACL, Edmonton, Canada, pp. 165–172 (2003)
Gale, W., Church, K.W., Yarowsky, D.: Work on statistical methods for word sense disambiguation. In: Intelligent Probabilistic Approaches to Natural Language. Fall Symposium Series, pp. 54–60 (1992)
Schütze, H.: Context space. In: Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 113–120. AAAI Press, Menlo Park (1992)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of The 17th International Conference on Computational Linguistics (COLING/ACL), pp. 768–774 (1998)
Weeds, J., Weir, D.: Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 439–475 (2005)
Weeds, J.: The reliability of a similarity measure. In: Proceedings of the 5th UK Special Interest Group for Computational Linguistics (CLUK), Manchester, UK (2005)
Smadja, F., McKeown, K.R., Hatzivassiloglou, V.: Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22, 1–38 (1996)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Bordag, S.: Elements of Knowledge-free and Unsupervised lexical acquisition. PhD thesis, Department of Natural Language Processing, University of Leipzig, Leipzig, Germany (2007)
Fellbaum, C.: A semantic network of English: The mother of all WordNets. Computers and the Humanities 32, 209–220 (1998)
Hamp, B., Feldweg, H.: GermaNet - a lexical-semantic net for German. In: Proceedings of workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications at the ACL, Madrid, Spain (1997)
Krenn, B., Evert, S.: Can we do better than frequency? a case study on extracting pp-verb collocations. In: Proceedings of the Workshop on Collocations at the ACL, Toulouse, France, pp. 39–46 (2001)
Zipf, G.K.: Human Behaviour and the Principle of Least-Effort. Cambridge MA edn. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bordag, S. (2008). A Comparison of Co-occurrence and Similarity Measures as Simulations of Context. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-78135-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78134-9
Online ISBN: 978-3-540-78135-6
eBook Packages: Computer ScienceComputer Science (R0)