Abstract
We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches.
This work supported in part by the EU sixth framework project RESQ, IST–1999–11234, the NoE QUIPROCONE IST–1999–29064, the ESF QiT Programmme, and the EU NoE PASCAL, and by the Netherlands Organization for Scientific Research (NWO) under Grant 612.052.004.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: AIP Conference Proceedings, vol. 779(1), pp. 81–89 (2005)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Review Lett. 88(4), 48702 (2002)
Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.: Information Distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998); Conference version: Thermodynamics of Computation and Information Distance, In: Proc. 25th ACM Symp. Theory of Comput. pp. 21–30 (1993)
Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American, 76–81 (June 2003)
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inform. Th. 50(7), 1545–1551 (2004)
Cilibrasi, R.: The CompLearn Toolkit, CWI, (2003), http://www.complearn.org/
Cimiano, P., Staab, S.: Learning by Googling. SIGKDD Explorations 6(2), 24–33 (2004)
Chai, W., Vercoe, B.: Folk music classification using hidden Markov models. In: Proc. of International Conference on Artificial Intelligence (2001)
Cilibrasi, R., Vitanyi, P.: Automatic Meaning Discovery Using Google: 100 Experiments in LearningWordNet Categories (2004), http://www.cwi.nl/~cilibrar/googlepaper/appendix.pdf
Cilibrasi, R., de Wolf, R., Vitanyi, P.: Algorithmic clustering of music based on string compression. Computer Music J. 28(4), 49–67 (2004), Web version http://xxx.lanl.gov/abs/cs.SD/0303025
Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005), Web version http://xxx.lanl.gov/abs/cs.CV/0312044
Cilibrasi, R., Vitanyi, P.: Automatic meaning discovery using Google, Manuscript, CWI (2004), http://arxiv.org/abs/cs.CL/0412098
Cilibrasi, R., Vitanyi, P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering. In: EUPASCAL Statistics and Optimization of Clustering Workshop, London, UK, July 5-6(2005), http://homepages.cwi.nl/paulv/papers/quartet.pdf
Dannenberg, R., Thom, B., Watson, D.: A machine learning approach to musical style recognition. In: Proc. International Computer Music Conference, pp. 344–347 (1997)
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Chichester (2001)
The basics of Google search, http://www.google.com/help/basics.html
Grimaldi, M., Kokaram, A., Cunningham, P.: Classifying music by genre using the wavelet packet transform and a round-robin ensemble. Technical report TCD-CS-2002-64, Trinity College Dublin (2002), http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCDCS-2002-64.pdf
Keogh, E., Lonardi, S., Rtanamahatana, C.A.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, pp. 206–215 (2004)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems Inform. Transmission 1(1), 1–7 (1965)
Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys 38(4), 29–40 (1983)
Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)
Lenat, D.B.: Cyc: A large-scale investment in knowledge infrastructure. Comm. ACM 38(11), 33–38 (1995)
Lesk, M.E.: Word-word associations in document retrieval systems. American Documentation 20(1), 27–38 (1969)
Li, M., Vitányi, P.M.B.: Theory of thermodynamics of computation. In: Proc. IEEE Physics of Computation Workshop, Dallas (Texas), October 4-6, pp. 42–46 (1992), A full version (basically the here relevant part of [26]) appeared in the Preliminary Proceedings handed out at the Workshop
Li, M., Vitányi, P.M.B.: Reversibility and adiabatic computation: trading time and space for energy. Proc. Royal Society of London, Series A 452, 769–789 (1996)
Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, New York (1997)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences based on approximate matching. In: Proc. 10th Workshop on Genome Informatics (GIW), Tokyo, December 14-15. Genome Informatics Series, vol. 10 (1999); Also in Proc. 4th ACM RECOMB, p. 107 (2000)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Li, M., Vitányi, P.M.B.: Algorithmic Complexity. In: Smelser, N.J., Baltes, P.B. (eds.) International Encyclopedia of the Social & Behavioral Sciences, pp. 376–382. Pergamon, Oxford (2001/2002)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004); Conference version in: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, USA, pp 863–872 (2003) Web version: http://xxx.lanl.gov/abs/cs.CC/0111054
Li, M., Vitanyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)
Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada, http://citeseer.nj.nec.com/509238.html
Scott, P.: Music classification using neural networks (2001), http://www.stanford.edu/class/ee373a/musicclassification.pdf
Strimmer, K., von Haeseler, A.: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13, 964–969 (1996)
Miller, G.A., et al.: WordNet, A Lexical Database for the English Language, Cognitive Science Lab. Princeton University, http://www.cogsci.princeton.edu/~wn
Terra, E., Clarke, C.L.A.: Frequency Estimates for Statistical Word Similarity Measures. In: HLT/NAACL 2003, Edmonton, Alberta, 37/162 (May 2003)
Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502 (2002)
Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293–302 (2002)
Wehner, S.: Analyzing network traffic and worms using compression, http://arxiv.org/abs/cs.CR/0504045
Corpus collosal: How well does the world wide web represent human language? The Economist, January 20 (2005), http://www.economist.com/science/displayStory.cfm?story-id=3576374
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cilibrasi, R., Vitanyi, P. (2006). Similarity of Objects and the Meaning of Words. In: Cai, JY., Cooper, S.B., Li, A. (eds) Theory and Applications of Models of Computation. TAMC 2006. Lecture Notes in Computer Science, vol 3959. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11750321_2
Download citation
DOI: https://doi.org/10.1007/11750321_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34021-8
Online ISBN: 978-3-540-34022-5
eBook Packages: Computer ScienceComputer Science (R0)