Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Semantic hashing

Published: 01 July 2009 Publication History

Abstract

We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs “semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set.

References

[1]
A. Andoni, P. Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, in: FOCS, IEEE Computer Society, 2006, pp. 459–468.
[2]
Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI, Large-Scale Kernel Machines (2007).
[3]
D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022.
[4]
S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: IEEE Computer Vision and Pattern Recognition or CVPR, 2005, pp. 539–546.
[5]
Datar, Immorlica, Indyk, Mirrokni, Locality-Sensitive Hashing scheme based on p-stable distributions, in: COMPGEOM: Annual ACM Symposium on Computational Geometry, 2004.
[6]
S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, R.A. Harshman, Indexing by latent semantic analysis, Journal of the American Society of Information Science 41 (6) (1990) 391–407.
[7]
J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic time, Technical Report 75-482, Stanford CS, 1975.
[8]
P. Gehler, A. Holub, M. Welling, The rate adapting poisson (RAP) model for information retrieval and object recognition, in: Proceedings of the 23rd International Conference on Machine Learning, 2006.
[9]
G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (8) (2002) 1711–1800.
[10]
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (7) (2006) 1527–1554.
[11]
G.E. Hinton, S.T. Roweis, Stochastic neighbor embedding, in: Advances in Neural Information Processing Systems, MIT Press, 2002, pp. 833–840.
[12]
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507.
[13]
T. Hofmann, Probabilistic latent semantic analysis, in: Proceedings of the 15th Conference on Uncertainty in AI, Morgan Kaufmann, San Fransisco, California, 1999, pp. 289–296.
[14]
D.D. Lewis, Y. Yang, T.G. Rose, F. Li, RCV1: a new benchmark collection for text categorization research, Journal of Machine Learning Research 5 (2004) 361–397.
[15]
T. Liu, A. Moore, A. Gray, K. Yang, An investigation of practical approximate nearest neighbor algorithms, Advances in Neural Information Processing Systems (2004).
[16]
R.M. Neal, G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse and other variants, in: M.I. Jordan (Ed.), Learning in Graphical Models, Kluwer Academic Press, 1998, pp. 355–368.
[17]
R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, AI and Statistics (2007).
[18]
Salton, Developments in automatic text retrieval, Science (1991) 253.
[19]
G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management 24 (5) (1988) 513–523.
[20]
A. Torralba, R. Fergus, Y. Weiss, Small codes and large image databases for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.
[21]
M. Welling, M. Rosen-Zvi, G. Hinton, Exponential family harmoniums with an application to information retrieval, in: Advances in Neural Information Processing Systems, vol. 4, MIT Press, Cambridge, MA, 2005, pp. 1481–1488.
[22]
E. Xing, R. Yan, A.G. Hauptmann, Mining associated text and images with dual-wing harmoniums, in: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005), 2005.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Approximate Reasoning
International Journal of Approximate Reasoning  Volume 50, Issue 7
Jul 2009
220 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 July 2009

Author Tags

  1. Information retrieval
  2. Graphical models
  3. Unsupervised learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)When Variety Seeking Meets UnexpectednessInformation Systems Research10.1287/isre.2021.005335:3(1257-1273)Online publication date: 1-Sep-2024
  • (2024)Toward Cost-Effective Adaptive Random Testing: An Approximate Nearest Neighbor ApproachIEEE Transactions on Software Engineering10.1109/TSE.2024.337959250:5(1182-1214)Online publication date: 21-Mar-2024
  • (2024)A review on deep learning applications with semanticsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124029251:COnline publication date: 24-Jul-2024
  • (2024)Toward Real-Time Solar Content-Based Image RetrievalComputational Science – ICCS 202410.1007/978-3-031-63749-0_8(107-120)Online publication date: 2-Jul-2024
  • (2023)AdANNSProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669458(76311-76335)Online publication date: 10-Dec-2023
  • (2023)Unified segment-to-segment framework for simultaneous sequence generationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668082(45235-45258)Online publication date: 10-Dec-2023
  • (2023)Learning binary codes for fast image retrieval with sparse discriminant analysis and deep autoencodersIntelligent Data Analysis10.3233/IDA-22668727:3(809-831)Online publication date: 1-Jan-2023
  • (2023)DeepLSH: A Deep Learning-based Locality Sensitive Hash MethodProceedings of the 15th International Conference on Digital Image Processing10.1145/3604078.3604136(1-9)Online publication date: 19-May-2023
  • (2023)Hashing One With AllProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611977(6420-6431)Online publication date: 26-Oct-2023
  • (2023)PASS: Personalized Advertiser-aware Sponsored SearchProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599882(4924-4936)Online publication date: 6-Aug-2023
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media