Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The Google Similarity Distance

Published: 01 March 2007 Publication History

Abstract

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society” is "database,” and the equivalent of "use” is "a way to search the database.” We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories.

References

[1]
J.P. Bagrow and D. ben-Avraham, “On the Google-Fame of Scientists and Other Populations,” Proc. Am. Inst. of Physics Conf., vol. 779, no. 1, pp. 81-89, 2005.
[2]
C.H. Bennett, P. Gács, M. Li, P.M.B. Vitányi, and W. Zurek, “Information Distance,” IEEE Trans. Information Theory, vol. 44, no. 4, pp. 1407-1423, 1998.
[3]
C.H. Bennett, M. Li, and B. Ma, “Chain Letters and Evolutionary Histories,” Scientific Am., pp. 76-81, June 2003.
[4]
C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[5]
Automatic Meaning Discovery Using Google: 100 Experiments in Learning WordNet Categories, 2004, http://www.cwi.nl/~cilibrar/googlepaper/appendix.pdf.
[6]
R. Cilibrasi, Complearn Home, http://www.complearn.org/, 2005.
[7]
R. Cilibrasi, R. de Wolf, and P. Vitanyi, “Algorithmic Clustering of Music Based on String Compression,” Computer Music J., vol. 28, no. 4, pp. 49-67, 2004.
[8]
R. Cilibrasi and P. Vitanyi, “Clustering by Compression,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1523-1545, 2005.
[9]
R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google,” http://xxx.lanl.gov/abs/cs.CL/0412098, 2004.
[10]
R. Cilibrasi and P. Vitanyi, “A New Quartet Tree Heuristic for Hierarchical Clustering,” http://arxiv.org/abs/cs.DS/0606048, 2006.
[11]
P. Cimiano and S. Staab, “Learning by Googling,” SIGKDD Explorations, vol. 6, no. 2, pp. 24-33, 2004.
[12]
T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991.
[13]
J.-P. Delahaye, “Classer Musiques, Langues, Images, Textes et Genomes,” Pour La Science, vol. 317, pp. 98-103, Mar. 2004.
[14]
“The Basics of Google Search,” http://www.google.com/help/basics.html, 2005.
[15]
L.G. Kraft, “A Device for Quantizing, Grouping and Coding Amplitude Modulated Pulses,” master's thesis, Dept. of Electrical Eng., Massachussetts Inst. of Technology, Cambridge, Mass., 1949.
[16]
D. Graham-Rowe, “A Search for Meaning,” New Scientist, p. 21, 29Jan. 2005.
[17]
Slashdot, 29 Jan. 2005, http://science.slashdot.org/article.pl? sid=05/01/29/1815242&tid=217&tid=14.
[18]
C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2004.
[19]
P. Cimiano and S. Staab, “Learning by Googling,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 2, pp. 24-33, Dec. 2004.
[20]
H. Muir, “Software to Unzip Identity of Unknown Composers,” New Scientist, Apr. 2003.
[21]
K. Patch, “Software Sorts Tunes,” Technology Research News, 23/30Apr. 2003.
[22]
D.B. Lenat, “CYC: A Large-Scale Investment in Knowledge Infrastructure,” Comm. ACM, vol. 38, no. 11, pp. 33-38, 1995.
[23]
F. Keller and M. Lapata, “Using the Web to Obtain Frequencies for Unseen Bigrams,” Computational Linguistics, vol. 29, no. 3, pp. 459-484, 2003.
[24]
A.N. Kolmogorov, “Three Approaches to the Quantitative Definition of Information,” Problems Inform. Transmission, vol. 1, no. 1, pp. 1-7, 1965.
[25]
M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, “An Information-Based Sequence Distance and Its Application to Whole Mitochondrial Genome Phylogeny,” Bioinformatics, vol. 17, no. 2, pp. 149-154, 2001.
[26]
M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, “The Similarity Metric,” IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250-3264, 2004.
[27]
M. Li and P.M.B. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, second ed. Springer-Verlag, 1997.
[28]
M. Li and P.M.B. Vitányi, “Algorithmic Complexity,” Int'l Encyclopedia of the Social & Behavioral Sciences, N.J. Smelser and P.B. Baltes, eds., pp. 376-382, Oxford, 2001/2002.
[29]
M. Li and P.M.B. Vitányi, “Reversibility and Adiabatic Computation: Trading Time and Space for Energy,” Proc. Royal Soc. of London, Series A, vol. 452, pp. 769-789, 1996.
[30]
S.L. Reed and D.B. Lenat, “Mapping Ontologies into CYC,” Proc. 18th Nat'l Conf. Artificial Intelligence Workshop Ontologies for the Semantic Web (AAAI '02), http://citeseer.nj.nec.com/509238.html, 2002.
[31]
D.H. Rumsfeld, The Digital Revolution, 9 June 2001, available in H. Seely, The Poetry of D.H. Rumsfeld, 2003, http://slate.msn.com/id/2081042/.
[32]
C.E. Shannon, “A Mathematical Theory of Communication,” Bell Systems Technical J., vol. 27, pp. 379-423, 623-656, 1948.
[33]
G.A. Miller et al., “WordNet, A Lexical Database for the English Language,” Cognitive Science Lab, Princeton Univ., http://www.cogsci.princeton.edu/~wn, 2006.
[34]
E. Terra and C.L.A. Clarke, “Frequency Estimates for Statistical Word Similarity Measures,” Proc. Human Language Technology Conf. (HLT/NAACL '03), pp. 165-172, May 2003.
[35]
M.E. Lesk, “Word-Word Associations in Document Retrieval Systems,” Am. Documentation, vol. 20, no. 1, pp. 27-38, 1969.
[36]
P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the Right Interestingness Measure for Associating Patterns,” Proc. ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491-502, 2002.
[37]
T. Landauer and S. Dumais, “A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge,” Psychology Rev., vol. 104, pp. 211-240, 1997.
[38]
“Corpus Collosal: How Well Does the World Wide Web Represent Human Language?” The Economist, 20 Jan. 2005, http://www. economist.com/science/displayStory.cfm?story_id=3576374.
[39]
E. Keogh, S. Lonardi, and C.A. Rtanamahatana, “Toward Parameter-Free Data Mining,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 206-215, 2004.
[40]
C. Costa Santos, J. Bernardes, P.M.B. Vitanyi, L. Antunes, “Clustering Fetal Heart Rate Tracings by Compression,” Proc. 19th IEEE Symp. Computer-Based Medical Systems, pp. 685-690, 2006.

Cited By

View all
  • (2024)Image Similarity Using an Ensemble of Context-Sensitive ModelsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672004(1758-1769)Online publication date: 25-Aug-2024
  • (2023)When Automatic Filtering Comes to the Rescue: Pre-Computing Company Competitor Pairs in OwlerProceedings of the ACM on Management of Data10.1145/35897871:2(1-23)Online publication date: 20-Jun-2023
  • (2023)Neurofuzzy semantic similarity measurementData & Knowledge Engineering10.1016/j.datak.2023.102155145:COnline publication date: 1-May-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Image Similarity Using an Ensemble of Context-Sensitive ModelsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672004(1758-1769)Online publication date: 25-Aug-2024
  • (2023)When Automatic Filtering Comes to the Rescue: Pre-Computing Company Competitor Pairs in OwlerProceedings of the ACM on Management of Data10.1145/35897871:2(1-23)Online publication date: 20-Jun-2023
  • (2023)Neurofuzzy semantic similarity measurementData & Knowledge Engineering10.1016/j.datak.2023.102155145:COnline publication date: 1-May-2023
  • (2023)Semantic Attack on Disassociated Transaction DataSN Computer Science10.1007/s42979-023-01781-64:4Online publication date: 20-Apr-2023
  • (2023)New information search model for online reviews with the perspective of user requirementsMultimedia Tools and Applications10.1007/s11042-023-14847-782:18(28165-28185)Online publication date: 20-Feb-2023
  • (2022)A multi-strategy approach for the merging of multiple taxonomiesJournal of Information Science10.1177/016555152095234048:3(283-303)Online publication date: 1-Jun-2022
  • (2022)Understanding the evolution of a scientific field by clustering and visualizing knowledge graphsJournal of Information Science10.1177/016555152093791548:1(71-89)Online publication date: 1-Feb-2022
  • (2022)Judging Instinct Exploitation in Statistical Data Explanations Based on Word EmbeddingProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3514094.3534171(867-879)Online publication date: 26-Jul-2022
  • (2022)Automatic development of requirement linking matrix based on semantic similarity for robust software developmentJournal of Systems and Software10.1016/j.jss.2021.111211186:COnline publication date: 1-Apr-2022
  • (2022)Smart objects recommendation based on pre-training with attention and the thing–thing​ relationship in social Internet of thingsFuture Generation Computer Systems10.1016/j.future.2021.11.006129:C(347-357)Online publication date: 1-Apr-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media