Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3584372.3588679acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article
Public Access

Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

Published: 18 June 2023 Publication History

Abstract

We present a new approach for independently computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Specifically, while our method exactly matches linear sketching for dense vectors, it yields significantly lower error for sparse vectors with limited overlap between non-zero entries. Such vectors arise in many applications involving sparse data, as well as in increasingly popular dataset search applications, where inner products are used to estimate data covariance, conditional means, and other quantities involving columns in unjoined tables. We complement our theoretical results by showing that our approach empirically outperforms existing linear sketches and unweighted hashing-based sketches for sparse vectors.

References

[1]
Dimitris Achlioptas. 2003. Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins. J. Comput. Syst. Sci., Vol. 66, 4 (2003), 671--687.
[2]
Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. 1999 a. Tracking Join and Self-Join Sizes in Limited Storage. In ¶ODS1999.
[3]
Noga Alon and Bo'az Klartag. 2017. Optimal Compression of Approximate Inner Products and Dimension Reduction. In FOCS2017. 639--650.
[4]
Noga Alon, Yossi Matias, and Mario Szegedy. 1999 b. The Space Complexity of Approximating the Frequency Moments. J. Comput. System Sci., Vol. 58, 1 (1999).
[5]
Rosa I. Arriaga and Santosh Vempala. 2006. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, Vol. 63, 2 (2006), 161--182.
[6]
Kevin Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. 2007. On Synopses for Distinct-Value Estimation under Multiset Operations. In SIGMOD2007. 199--210.
[7]
Avrim Blum, John Hopcroft, and Ravindran Kannan. 2020. Foundations of Data Science. Cambridge University Press.
[8]
A.Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997. 21--29.
[9]
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-Wise Independent Permutations (Extended Abstract). In STOC1998. 327--336.
[10]
J. Lawrence Carter and Mark N. Wegman. 1979. Universal classes of hash functions. J. Comput. System Sci., Vol. 18, 2 (1979), 143--154.
[11]
Moses Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In STOC2002. 380--388.
[12]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding Frequent Items in Data Streams. In ICALP2002. 693--703.
[13]
Lianhua Chi and Xingquan Zhu. 2017. Hashing Techniques: A Survey and Taxonomy. ACM Comput. Surv., Vol. 50, 1 (2017).
[14]
Tobias Christiani. 2020. DartMinHash: Fast Sketching for Weighted Sets. arXiv2005.11547 (2020).
[15]
City of New York. 2022. Open Data NYC. https://opendata.cityofnewyork.us/.
[16]
Edith Cohen. 2016. Min-Hash Sketches. Springer New York, New York, NY, 1282--1287.
[17]
Edith Cohen and Haim Kaplan. 2007. Summarizing Data Using Bottom-k Sketches. In ¶ODC2007. 225--234.
[18]
Edith Cohen and Haim Kaplan. 2013. What You Can Do with Coordinated Samples. In APPROX2013. 452--467.
[19]
Graham Cormode, Minos Garofalakis, Peter Haas, and Chris Jermaine. 2011. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. NOW publishers.
[20]
Sanjoy Dasgupta and Anupam Gupta. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, Vol. 22, 1 (2003), 60--65.
[21]
Otmar Ertl. 2018. BagMinHash - Minwise Hashing Algorithm for Weighted Sets. In KDD2018. 1368--1377.
[22]
Raul Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A cardinality-based method for coupled estimation of Jaccard similarity and containment. In ICDE2019. 1190--1201.
[23]
Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci., Vol. 31, 2 (1985), 182--209.
[24]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases. 518--529.
[25]
Sreenivas Gollapudi and Rina Panigrahy. 2006. Exploiting Asymmetry in Hierarchical Topic Extraction. In CIKM2006. 475--482.
[26]
Bernhard Haeupler, Mark Manasse, and Kunal Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. arXiv1410.4266 (2014).
[27]
Nevin Heintze. 1996. Scalable Document Fingerprinting. In USENIX Workshop on Electronic Commerce.
[28]
Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In ICDM2010. 246--255.
[29]
Laurent Jacques. 2015. A Quantized Johnson--Lindenstrauss Lemma: The Finding of Buffon's Needle. IEEE Trans. Inf., Vol. 61, 9 (2015), 5012--5027.
[30]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An Optimal Algorithm for the Distinct Elements Problem. In ¶ODS2010. 41--52.
[31]
James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 1--10.
[32]
Kasper Green Larsen and Jelani Nelson. 2017. Optimality of the Johnson-Lindenstrauss Lemma. In FOCS2017. 633--638.
[33]
Kasper Green Larsen, Rasmus Pagh, and Jakub Tve tek. 2021. CountSketches, Feature Hashing and the Median of Three. In ICML2021. 6011--6020.
[34]
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer. 2015. The Mannheim Search Join Engine. Journal of Web Semantics, Vol. 35 (2015), 159 -- 166.
[35]
Ping Li. 2017. Linearized GMM Kernels and Normalized Random Fourier Features. In KDD2017. 315--324.
[36]
Ping Li, Kenneth Church, and Trevor Hastie. 2006. Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data. In NIPS2006, Vol. 19.
[37]
Ping Li and Arnd Christian König. 2010. b-Bit Minwise Hashing. In WWW2010.
[38]
Ping Li, Michael Mitzenmacher, and Martin Slawski. 2016. Quantized Random Projections and Non-Linear Estimation of Cosine Similarity. In NIPS2016, Vol. 29.
[39]
Ping Li, Art Owen, and Cun-hui Zhang. 2012. One Permutation Hashing. In NIPS2012.
[40]
Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd König. 2011. Hashing algorithms for large-scale learning. NIPS2011, Vol. 24 (2011).
[41]
Mark Manasse, Frank McSherry, and Kunal Talwar. 2010. Consistent Weighted Sampling. Technical Report MSR-TR-2010--73. https://www.microsoft.com/en-us/research/publication/consistent-weighted-sampling/
[42]
Udi Manber. 1994. Finding Similar Files in a Large File System. In USENIX Winter 1994 Technical Conference.
[43]
Tom Mitchell. 1997. 20 Newsgroups Dataset. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html.
[44]
Rasmus Pagh, Morten Stöckel, and David P. Woodruff. 2014. Is Min-Wise Hashing Optimal for Summarizing Set Intersection?. In ¶ODS2014. 109--120.
[45]
Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation. ACM Transactions on Database Systems (TODS), Vol. 33, 3 (2008), 1--46.
[46]
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM, Vol. 18, 11 (1975), 613--620.
[47]
Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, and Juliana Freire. 2021. Correlation Sketches for Approximate Join-Correlation Queries. In SIGMOD2021. 199--210.
[48]
Aécio Santos, Aline Bessa, Christopher Musco, and Juliana Freire. 2022. A sketch-based index for correlated dataset search. In ICDE2022. IEEE, 2928--2941.
[49]
Anshumali Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In NIPS2016. 1506--1514.
[50]
Anshumali Shrivastava and Ping Li. 2014. In Defense of Minhash over Simhash. In AISTATS2014.
[51]
World Bank. 2022. World Bank Group Finances. https://finances.worldbank.org/
[52]
Wei Wu, Bin Li, Ling Chen, Junbin Gao, and Chengqi Zhang. 2020. A Review for Weighted MinHash Algorithms. IEEE Trans. Knowl. Data Eng. (2020), 1--1.
[53]
Wei Wu, Bin Li, Ling Chen, Chengqi Zhang, and Philip S. Yu. 2019. Improved Consistent Weighted Sampling Revisited. IEEE Trans. Knowl. Data Eng., Vol. 31, 12 (2019), 2332--2345.
[54]
Yang Yang, Ying Zhang, Wenjie Zhang, and Zengfeng Huang. 2019. Gb-kmv: An augmented kmv sketch for approximate containment similarity search. In ICDE2019. IEEE, 458--469.
[55]
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In SIGMOD2019. 847--864.
[56]
Erkang Zhu, Fatemeh Nargesian, Ken Q Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-Scale Domain Search. Proceedings of the VLDB Endowment, Vol. 9, 12 (2016). io

Cited By

View all
  • (2024)Sketches-Based Join Size Estimation Under Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00140(1726-1738)Online publication date: 13-May-2024
  • (2024)Efficiently Estimating Mutual Information Between Attributes Across Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00022(193-206)Online publication date: 13-May-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '23: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2023
392 pages
ISBN:9798400701276
DOI:10.1145/3584372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. inner product estimation
  2. join-size estimation
  3. vector sketching

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)161
  • Downloads (Last 6 weeks)37
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Sketches-Based Join Size Estimation Under Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00140(1726-1738)Online publication date: 13-May-2024
  • (2024)Efficiently Estimating Mutual Information Between Attributes Across Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00022(193-206)Online publication date: 13-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media