research-article

Semantic hashing

Authors: Ruslan Salakhutdinov, Geoffrey HintonAuthors Info & Claims

Volume 50, Issue 7

Pages 969 - 978

https://doi.org/10.1016/j.ijar.2008.11.006

Published: 01 July 2009 Publication History

Abstract

We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set.

References

[1]

A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, in: FOCS, IEEE Computer Society, 2006, pp. 459–468.

[2]

Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI, Large-Scale Kernel Machines (2007).

[3]

D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022.

Digital Library

[4]

S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: IEEE Computer Vision and Pattern Recognition or CVPR, 2005, pp. 539–546.

[5]

Datar, Immorlica, Indyk, Mirrokni, Locality-Sensitive Hashing scheme based on p-stable distributions, in: COMPGEOM: Annual ACM Symposium on Computational Geometry, 2004.

[6]

S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, R.A. Harshman, Indexing by latent semantic analysis, Journal of the American Society of Information Science 41 (6) (1990) 391–407.

[7]

J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic time, Technical Report 75-482, Stanford CS, 1975.

[8]

P. Gehler, A. Holub, M. Welling, The rate adapting poisson (RAP) model for information retrieval and object recognition, in: Proceedings of the 23rd International Conference on Machine Learning, 2006.

[9]

G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (8) (2002) 1711–1800.

[10]

G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (7) (2006) 1527–1554.

Digital Library

[11]

G.E. Hinton, S.T. Roweis, Stochastic neighbor embedding, in: Advances in Neural Information Processing Systems, MIT Press, 2002, pp. 833–840.

[12]

G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.

[13]

T. Hofmann, Probabilistic latent semantic analysis, in: Proceedings of the 15th Conference on Uncertainty in AI, Morgan Kaufmann, San Fransisco, California, 1999, pp. 289–296.

[14]

D.D. Lewis, Y. Yang, T.G. Rose, F. Li, RCV1: a new benchmark collection for text categorization research, Journal of Machine Learning Research 5 (2004) 361–397.

Digital Library

[15]

T. Liu, A. Moore, A. Gray, K. Yang, An investigation of practical approximate nearest neighbor algorithms, Advances in Neural Information Processing Systems (2004).

[16]

R.M. Neal, G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse and other variants, in: M.I. Jordan (Ed.), Learning in Graphical Models, Kluwer Academic Press, 1998, pp. 355–368.

[17]

R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, AI and Statistics (2007).

[18]

Salton, Developments in automatic text retrieval, Science (1991) 253.

[19]

G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management 24 (5) (1988) 513–523.

Digital Library

[20]

A. Torralba, R. Fergus, Y. Weiss, Small codes and large image databases for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.

[21]

M. Welling, M. Rosen-Zvi, G. Hinton, Exponential family harmoniums with an application to information retrieval, in: Advances in Neural Information Processing Systems, vol. 4, MIT Press, Cambridge, MA, 2005, pp. 1481–1488.

[22]

E. Xing, R. Yan, A.G. Hauptmann, Mining associated text and images with dual-wing harmoniums, in: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005), 2005.

Cited By

Li PTuzhilin A(2024)When Variety Seeking Meets UnexpectednessInformation Systems Research10.1287/isre.2021.005335:3(1257-1273)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1287/isre.2021.0053
Huang RCui CLian JTowey DSun WChen H(2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3379592
Akdemir EBarışçı N(2024)A review on deep learning applications with semanticsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124029251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124029
Show More Cited By

Recommendations

LSISOM – A Latent Semantic Indexing Approach to Self-Organizing Maps of Document Collections

The Self Organizing Map (SOM) algorithm has been utilized, with much success, in a variety of applications for the automatic organization of full-text document collections. A great advantage of the SOM method is that document collections can be ordered ...
Extracting and modeling the semantic information content of web documents to support semantic document retrieval
APCCM '09: Proceedings of the Sixth Asia-Pacific Conference on Conceptual Modeling - Volume 96

Existing HTML mark-up is used only to indicate the structure and lay-out of documents, but not the document semantics. As a result web documents are difficult to be semantically processed, retrieved and explored by computer applications. Existing ...
Multilingual extraction of semantic indexes
SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexing

This article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Approximate Reasoning

International Journal of Approximate Reasoning Volume 50, Issue 7

Jul 2009

220 pages

ISSN:0888-613X

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 July 2009

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

371
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li PTuzhilin A(2024)When Variety Seeking Meets UnexpectednessInformation Systems Research10.1287/isre.2021.005335:3(1257-1273)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1287/isre.2021.0053
Huang RCui CLian JTowey DSun WChen H(2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: 21-Mar-2024
https://dl.acm.org/doi/10.1109/TSE.2024.3379592
Akdemir EBarışçı N(2024)A review on deep learning applications with semanticsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124029251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124029
Grycuk RDe Magistris GNapoli CScherer R(2024)Toward Real-Time Solar Content-Based Image RetrievalComputational Science – ICCS 202410.1007/978-3-031-63749-0_8(107-120)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63749-0_8
Rege AKusupati AS SFan ACao QKakade SJain PFarhadi AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)AdANNSProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669458(76311-76335)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669458
Zhang SFeng YOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Unified segment-to-segment framework for simultaneous sequence generationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668082(45235-45258)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668082
Hong SHuu QViet DThi Thuy QQuoc T(2023)Learning binary codes for fast image retrieval with sparse discriminant analysis and deep autoencodersIntelligent Data Analysis10.3233/IDA-22668727:3(809-831)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDA-226687
Liu RDou QLi HZhang JLiu J(2023)DeepLSH: A Deep Learning-based Locality Sensitive Hash MethodProceedings of the 15th International Conference on Digital Image Processing10.1145/3604078.3604136(1-9)Online publication date: 19-May-2023
https://dl.acm.org/doi/10.1145/3604078.3604136
Yu JShen YZhang HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Hashing One With AllProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611977(6420-6431)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611977
Tian ZLi CZuo ZWen ZSun LHu XZhang WHuang HWang SDeng WXie XZhang QSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)PASS: Personalized Advertiser-aware Sponsored SearchProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599882(4924-4936)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599882
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents