Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2981780.2981899guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

One sketch for all: theory and application of Conditional Random Sampling

Published: 08 December 2008 Publication History

Abstract

Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise (l2, l1) distances, in static, large-scale, and sparse data. This study modifies the original CRS and extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with many other sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is "one-sketch-for-all." In particular, we demonstrate the effectiveness of CRS in efficiently computing the Hamming norm, the Hamming distance, the lp distance, and the χ2 distance. A generic estimator and an approximate variance formula are also provided, for approximating any type of distances.
We recommend CRS as a promising tool for building highly scalable systems, in machine learning, data mining, recommender systems, and information retrieval.

References

[1]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. On demand classification of data streams. In KDD, 503-508, 2004.
[2]
Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors. Large-Scale Kernel Machines. The MIT Press, 2007.
[3]
Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, 10(5):1055-1064, 1999.
[4]
Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Transactions on Knowledge and Data Engineering, 15(3):529-540, 2003.
[5]
Carlotta Domeniconi and Dimitrios Gunopulos. Incremental support vector machine construction. In ICDM, pages 589-592, 2001.
[6]
Sudipto Guha, Piotr Indyk, and Andrew McGregor. Sketching infomration divergence. In COLT, pages 424-438, 2007.
[7]
M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pages 136-143, 2005.
[8]
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. of ACM, 53(3):307-323, 2006.
[9]
Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494-501, 2007.
[10]
Ping Li. Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections. In SODA, 2008.
[11]
Ping Li. Compressed Counting. In SODA, 2009.
[12]
Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way Associations. Computational Linguistics, 33(3):305-354, 2007. Preliminary results appeared in HLT/EMNLP, 2005.
[13]
Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873-880, 2007.
[14]
Ping Li. Computationally efficient estimators for dimension reductions using stable random projections. In ICDM, 2008.
[15]
S. Muthukrishnan. Data streams: Algorithms and applications. Found. and Trends in Theoretical Computer Science, 1:117-236, 2 2005.
[16]
John C. Platt. Using analytic QP and sparseness to speed training of support vector machines. In NIPS, pages 557-563, 1998.
[17]
Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels. The MIT Press, 2002.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'08: Proceedings of the 21st International Conference on Neural Information Processing Systems
December 2008
2000 pages
ISBN:9781605609492

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 08 December 2008

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)An accurate estimation algorithm for big data streamsDistributed and Parallel Databases10.1007/s10619-018-7225-536:3(461-483)Online publication date: 1-Sep-2018
  • (2017)Binary Vectors for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-017-9914-x53:1(138-156)Online publication date: 1-Jan-2017
  • (2016)Real-Valued Embeddings and Sketches for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-016-9899-x52:6(967-988)Online publication date: 1-Nov-2016
  • (2014)Estimation for monotone samplingProceedings of the 2014 ACM symposium on Principles of distributed computing10.1145/2611462.2611485(124-133)Online publication date: 15-Jul-2014
  • (2014)All-distances sketches, revisitedProceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/2594538.2594546(88-99)Online publication date: 18-Jun-2014
  • (2010)Finding planted partitions in nearly linear time using arrested spectral clusteringProceedings of the 27th International Conference on International Conference on Machine Learning10.5555/3104322.3104341(135-142)Online publication date: 21-Jun-2010
  • (2010)b-Bit minwise hashing for estimating three-way similaritiesProceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 210.5555/2997046.2997051(1387-1395)Online publication date: 6-Dec-2010

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media