Similarity of Objects and the Meaning of Words

Rudi Cilibrasi¹⁹ &
Paul Vitanyi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3959))

Included in the following conference series:

International Conference on Theory and Applications of Models of Computation

1161 Accesses
16 Citations

Abstract

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches.

This work supported in part by the EU sixth framework project RESQ, IST–1999–11234, the NoE QUIPROCONE IST–1999–29064, the ESF QiT Programmme, and the EU NoE PASCAL, and by the Netherlands Organization for Scientific Research (NWO) under Grant 612.052.004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Dimensions of Semantic Similarity

Web Similarity in Sets of Search Terms Using Database Queries

Article 12 May 2020

Distance-Based Index Structures for Fast Similarity Search

Article 27 July 2017

References

Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: AIP Conference Proceedings, vol. 779(1), pp. 81–89 (2005)
Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Review Lett. 88(4), 48702 (2002)
Article Google Scholar
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.: Information Distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998); Conference version: Thermodynamics of Computation and Information Distance, In: Proc. 25th ACM Symp. Theory of Comput. pp. 21–30 (1993)
Article MATH Google Scholar
Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American, 76–81 (June 2003)
Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Article Google Scholar
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inform. Th. 50(7), 1545–1551 (2004)
Article MathSciNet Google Scholar
Cilibrasi, R.: The CompLearn Toolkit, CWI, (2003), http://www.complearn.org/
Cimiano, P., Staab, S.: Learning by Googling. SIGKDD Explorations 6(2), 24–33 (2004)
Article Google Scholar
Chai, W., Vercoe, B.: Folk music classification using hidden Markov models. In: Proc. of International Conference on Artificial Intelligence (2001)
Google Scholar
Cilibrasi, R., Vitanyi, P.: Automatic Meaning Discovery Using Google: 100 Experiments in LearningWordNet Categories (2004), http://www.cwi.nl/~cilibrar/googlepaper/appendix.pdf
Cilibrasi, R., de Wolf, R., Vitanyi, P.: Algorithmic clustering of music based on string compression. Computer Music J. 28(4), 49–67 (2004), Web version http://xxx.lanl.gov/abs/cs.SD/0303025
Google Scholar
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005), Web version http://xxx.lanl.gov/abs/cs.CV/0312044
Google Scholar
Cilibrasi, R., Vitanyi, P.: Automatic meaning discovery using Google, Manuscript, CWI (2004), http://arxiv.org/abs/cs.CL/0412098
Cilibrasi, R., Vitanyi, P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering. In: EUPASCAL Statistics and Optimization of Clustering Workshop, London, UK, July 5-6(2005), http://homepages.cwi.nl/paulv/papers/quartet.pdf
Dannenberg, R., Thom, B., Watson, D.: A machine learning approach to musical style recognition. In: Proc. International Computer Music Conference, pp. 344–347 (1997)
Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Chichester (2001)
MATH Google Scholar
The basics of Google search, http://www.google.com/help/basics.html
Grimaldi, M., Kokaram, A., Cunningham, P.: Classifying music by genre using the wavelet packet transform and a round-robin ensemble. Technical report TCD-CS-2002-64, Trinity College Dublin (2002), http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCDCS-2002-64.pdf
Keogh, E., Lonardi, S., Rtanamahatana, C.A.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, pp. 206–215 (2004)
Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems Inform. Transmission 1(1), 1–7 (1965)
MathSciNet Google Scholar
Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys 38(4), 29–40 (1983)
Article MATH Google Scholar
Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)
Article Google Scholar
Lenat, D.B.: Cyc: A large-scale investment in knowledge infrastructure. Comm. ACM 38(11), 33–38 (1995)
Article Google Scholar
Lesk, M.E.: Word-word associations in document retrieval systems. American Documentation 20(1), 27–38 (1969)
Article Google Scholar
Li, M., Vitányi, P.M.B.: Theory of thermodynamics of computation. In: Proc. IEEE Physics of Computation Workshop, Dallas (Texas), October 4-6, pp. 42–46 (1992), A full version (basically the here relevant part of [26]) appeared in the Preliminary Proceedings handed out at the Workshop
Google Scholar
Li, M., Vitányi, P.M.B.: Reversibility and adiabatic computation: trading time and space for energy. Proc. Royal Society of London, Series A 452, 769–789 (1996)
Article MATH Google Scholar
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, New York (1997)
MATH Google Scholar
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences based on approximate matching. In: Proc. 10th Workshop on Genome Informatics (GIW), Tokyo, December 14-15. Genome Informatics Series, vol. 10 (1999); Also in Proc. 4th ACM RECOMB, p. 107 (2000)
Google Scholar
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Article Google Scholar
Li, M., Vitányi, P.M.B.: Algorithmic Complexity. In: Smelser, N.J., Baltes, P.B. (eds.) International Encyclopedia of the Social & Behavioral Sciences, pp. 376–382. Pergamon, Oxford (2001/2002)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004); Conference version in: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, USA, pp 863–872 (2003) Web version: http://xxx.lanl.gov/abs/cs.CC/0111054
Article MathSciNet Google Scholar
Li, M., Vitanyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)
MATH Google Scholar
Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada, http://citeseer.nj.nec.com/509238.html
Scott, P.: Music classification using neural networks (2001), http://www.stanford.edu/class/ee373a/musicclassification.pdf
Strimmer, K., von Haeseler, A.: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13, 964–969 (1996)
Google Scholar
Miller, G.A., et al.: WordNet, A Lexical Database for the English Language, Cognitive Science Lab. Princeton University, http://www.cogsci.princeton.edu/~wn
Terra, E., Clarke, C.L.A.: Frequency Estimates for Statistical Word Similarity Measures. In: HLT/NAACL 2003, Edmonton, Alberta, 37/162 (May 2003)
Google Scholar
Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502 (2002)
Google Scholar
Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293–302 (2002)
Article Google Scholar
Wehner, S.: Analyzing network traffic and worms using compression, http://arxiv.org/abs/cs.CR/0504045
Corpus collosal: How well does the world wide web represent human language? The Economist, January 20 (2005), http://www.economist.com/science/displayStory.cfm?story-id=3576374

Download references

Author information

Authors and Affiliations

CWI, Kruislaan 413, 1098 SJ, Amsterdam, The Netherlands
Rudi Cilibrasi & Paul Vitanyi

Authors

Rudi Cilibrasi
View author publications
You can also search for this author in PubMed Google Scholar
Paul Vitanyi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Sciences Department, University of Wisconsin, 53706, Madison, WI, USA
Jin-Yi Cai
School of Mathematics, University of Leeds, LS2 9JT, Leeds, U.K.
S. Barry Cooper
State Key Lab. of Computer Science, Institute of Software, Chinese Academy of Sciences,
Angsheng Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cilibrasi, R., Vitanyi, P. (2006). Similarity of Objects and the Meaning of Words. In: Cai, JY., Cooper, S.B., Li, A. (eds) Theory and Applications of Models of Computation. TAMC 2006. Lecture Notes in Computer Science, vol 3959. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11750321_2

Download citation

DOI: https://doi.org/10.1007/11750321_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34021-8
Online ISBN: 978-3-540-34022-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Similarity of Objects and the Meaning of Words

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Dimensions of Semantic Similarity

Web Similarity in Sets of Search Terms Using Database Queries

Distance-Based Index Structures for Fast Similarity Search

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Similarity of Objects and the Meaning of Words

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Dimensions of Semantic Similarity

Web Similarity in Sets of Search Terms Using Database Queries

Distance-Based Index Structures for Fast Similarity Search

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation