research-article

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Authors:

Rui XingAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 2432 - 2443

https://doi.org/10.1145/3637528.3671695

Published: 24 August 2024 Publication History

Abstract

Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element carries a positive weight. Unlike traditional cardinality estimation, limited research exists on weighted cardinality, with current methods requiring substantial memory and computational resources, challenging for devices with limited capabilities and real-time applications like anomaly detection. To address these issues, we propose QSketch, a memory-efficient sketch method for estimating weighted cardinality in streams. QSketch uses a quantization technique to condense continuous variables into a compact set of integer variables, with each variable requiring only 8 bits, making it 8 times smaller than previous methods. Furthermore, we leverage dynamic properties during QSketch generation to significantly enhance estimation accuracy and achieve a lower time complexity of O(1) for updating estimations upon encountering a new element. Experimental results on synthetic and real-world datasets show that QSketch is approximately 30% more accurate and two orders of magnitude faster than the state-of-the-art, using only 1/8 of the memory.

Supplemental Material

MP4 File - 227_qsketch_promo_vedio

Promotional Video of QSketch in KDD 2024

Download
18.64 MB

References

[1]

2010. Twitter. https://anlab-kaist.github.io/traces/

[2]

2018. CAIDA. https://www.caida.org/

[3]

2022. https://github.com/mkarppa/hyperlogloglog/blob/master/hyperlogloglog/ PackedVector.hpp#L130

[4]

2024. https://github.com/Rezerolird/QSketch

[5]

Saba Akram and Quarrat Ul Ann. 2015. Newton raphson method. International Journal of Scientific & Engineering Research 6, 7 (2015), 1748--1752.

[6]

Aiyou Chen, Jin Cao, Larry Shepp, and Tuan Nguyen. 2011. Distinct counting with a self-learning bitmap. JASA 106, 495 (2011), 879--890.

[7]

Wenji Chen, Yang Liu, and Yong Guan. 2013. Cardinality change-based early detection of large-scale cyber-attacks. In IEEE INFOCOM. 1788--1796.

[8]

Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-k sketches. In PODC. 225--234.

[9]

Edith Cohen and Haim Kaplan. 2008. Tighter estimation using bottom k sketches. PVLDB 1, 1 (2008), 213--224.

Digital Library

[10]

Reuven Cohen, Liran Katzir, and Aviv Yehezkel. 2017. A minimal variance estimator for the cardinality of big data set intersection. In ACM SIGKDD. 95-- 103.

[11]

Jeffrey Considine, Feifei Li, George Kollios, and John Byers. 2004. Approximate aggregation techniques for sensor databases. In IEEE ICDE. 449--460.

[12]

Harald Cramér. 1999. Mathematical methods of statistics. Vol. 26. Princeton university press.

[13]

Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In ESA. 605--617.

[14]

Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In SIGCOMM. 153--166.

[15]

Ronald A Fisher and Frank Yates. 1938. Statistical tables: For biological, agricultural and medical research. Oliver and Boyd.

[16]

P. Flajolet. 1990. On Adaptive Sampling. Computing 43, 4 (1990), 391--400.

Digital Library

[17]

Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings (2007).

[18]

Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. JCSS 31, 2 (1985), 182--209.

Digital Library

[19]

Phillip B Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, Vol. 1. 541--550.

[20]

Frédéric Giroire. 2009. Order statistics and estimating cardinalities of massive data sets. DAM 157, 2 (2009), 406--427.

Digital Library

[21]

Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experimental survey. PVLDB 11, 4 (2017), 499--512.

[22]

Daniel M Kane, Jelani Nelson, and David PWoodruff. 2010. An optimal algorithm for the distinct elements problem. In PODS. 41--52.

[23]

Matti Karppa and Rasmus Pagh. 2022. HyperLogLogLog: Cardinality Estimation With One Log More. In ACM SIGKDD. 753--761.

[24]

S Sathiya Keerthi, Dennis DeCoste, and Thorsten Joachims. 2005. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research 6, 3 (2005).

[25]

Jérôme Kunegis, Gerd Gröner, and Thomas Gottron. 2012. Online dating recommender systems: The split-complex number approach. In Proceedings of the 4th ACM RecSys workshop on Recommender systems and the social web. 37--44.

Digital Library

[26]

Jakub Lemiesz. 2021. On the algebra of data sketches. PVLDB 14, 9 (2021), 1655--1667.

Digital Library

[27]

Jakub Lemiesz. 2023. Efficient framework for operating on data sketches. PVLDB 16, 8 (2023), 1967--1978.

Digital Library

[28]

David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR 5, Apr (2004), 361--397.

[29]

Ping Li and Christian König. 2010. b-Bit minwise hashing. In WWW. 671--680.

[30]

Xiaohui Long and Torsten Suel. 2005. Three-level caching for efficient query processing in large web search engines. In WebConf. 257--266.

[31]

Rifat Ozcan, Ismail Sengor Altingovde, and Özgür Ulusoy. 2011. Cost-aware strategies for query result caching in web search engines. ACM TWEB 5, 2 (2011), 1--25.

Digital Library

[32]

Christopher R Palmer, Georgos Siganos, Michalis Faloutsos, Christos Faloutsos, and Phillip B Gibbons. 2001. The connectivity and fault-tolerance of the Internet topology. (2001).

[33]

Yiyan Qi, Pinghui Wang, Yuanming Zhang, Junzhou Zhao, Guangjian Tian, and Xiaohong Guan. 2020. Fast generating a large number of gumbel-max variables. In WebConf. 796--807.

[34]

Daniel Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In ACM SIGKDD. 442--451.

[35]

Daniel Ting. 2016. Towards optimal cardinality estimation of unions and intersections with sketches. In ACM SIGKDD. 1195--1204.

[36]

De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In IEEE CollaborateCom. 40--49.

Digital Library

[37]

Pinghui Wang, Xiaohong Guan, Tao Qin, and Qiuzhen Huang. 2011. A data streaming method for monitoring host connection degrees of high-speed links. IEEE TIFS 6, 3 (2011), 1086--1098.

[38]

Pinghui Wang, Dongdong Xie, Junzhou Zhao, Jinsong Li, Zhicheng Li, Rundong Li, and Yang Ren. 2024. Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables. IEEE Transactions on Knowledge and Data Engineering (2024).

[39]

Wu Wei, Bin Li, Chen Ling, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In WebConf. 1035--1043.

[40]

Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM TODS 15, 2 (1990), 208--229.

Digital Library

[41]

Viktor Witkovsky. 2001. Computing the distribution of a linear combination of inverted gamma variables. Science Direct Working Paper S1574-0358 (2001), 04.

[42]

Qingjun Xiao, Shigang Chen, Min Chen, and Yibei Ling. 2015. Hyper-compact virtual estimators for big network data based on register sharing. In ACM SIGMETRICS. 417--428.

[43]

Qingjun Xiao, You Zhou, and Shigang Chen. 2017. Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In IEEE INFOCOM. 1--9.

[44]

M Yoon, Tao Li, Shigang Chen, and J-K Peir. 2009. Fit a spread estimator in small memory. In IEEE INFOCOM. 504--512.

[45]

Yuanming Zhang, Pinghui Wang, Yiyan Qi, Kuankuan Cheng, Junzhou Zhao, Guangjian Tian, and Xiaohong Guan. 2023. Fast Gumbel-Max Sketch and its Applications. IEEE TKDE (2023).

Digital Library

[46]

Qi Zhao, Abhishek Kumar, and Jun Xu. 2005. Joint Data Streaming and Sampling Techniques for Detection of Super Sources and Destinations. In ACM SIGCOMM. 77--90.

Index Terms

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
  2. Models of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been ...
Weighted Similarity Estimation in Data Streams
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Similarity computation between pairs of objects is often a bottleneck in many applications that have to deal with massive volumes of data. Motivated by applications such as collaborative filtering in large-scale recommender systems, and influence ...
Virtual self-adaptive bitmap for online cardinality estimation
Abstract
Cardinality estimation is the task of obtaining the number of distinct items in a data stream, which plays an important role in many application domains. However, when dealing with high-speed data streams, it remains a significant ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
83
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)22

Reflects downloads up to 28 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents