Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3637528.3671695acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams

Published: 24 August 2024 Publication History

Abstract

Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element carries a positive weight. Unlike traditional cardinality estimation, limited research exists on weighted cardinality, with current methods requiring substantial memory and computational resources, challenging for devices with limited capabilities and real-time applications like anomaly detection. To address these issues, we propose QSketch, a memory-efficient sketch method for estimating weighted cardinality in streams. QSketch uses a quantization technique to condense continuous variables into a compact set of integer variables, with each variable requiring only 8 bits, making it 8 times smaller than previous methods. Furthermore, we leverage dynamic properties during QSketch generation to significantly enhance estimation accuracy and achieve a lower time complexity of O(1) for updating estimations upon encountering a new element. Experimental results on synthetic and real-world datasets show that QSketch is approximately 30% more accurate and two orders of magnitude faster than the state-of-the-art, using only 1/8 of the memory.

Supplemental Material

MP4 File - 227_qsketch_promo_vedio
Promotional Video of QSketch in KDD 2024

References

[1]
2010. Twitter. https://anlab-kaist.github.io/traces/
[2]
2018. CAIDA. https://www.caida.org/
[3]
2022. https://github.com/mkarppa/hyperlogloglog/blob/master/hyperlogloglog/ PackedVector.hpp#L130
[4]
2024. https://github.com/Rezerolird/QSketch
[5]
Saba Akram and Quarrat Ul Ann. 2015. Newton raphson method. International Journal of Scientific & Engineering Research 6, 7 (2015), 1748--1752.
[6]
Aiyou Chen, Jin Cao, Larry Shepp, and Tuan Nguyen. 2011. Distinct counting with a self-learning bitmap. JASA 106, 495 (2011), 879--890.
[7]
Wenji Chen, Yang Liu, and Yong Guan. 2013. Cardinality change-based early detection of large-scale cyber-attacks. In IEEE INFOCOM. 1788--1796.
[8]
Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-k sketches. In PODC. 225--234.
[9]
Edith Cohen and Haim Kaplan. 2008. Tighter estimation using bottom k sketches. PVLDB 1, 1 (2008), 213--224.
[10]
Reuven Cohen, Liran Katzir, and Aviv Yehezkel. 2017. A minimal variance estimator for the cardinality of big data set intersection. In ACM SIGKDD. 95-- 103.
[11]
Jeffrey Considine, Feifei Li, George Kollios, and John Byers. 2004. Approximate aggregation techniques for sensor databases. In IEEE ICDE. 449--460.
[12]
Harald Cramér. 1999. Mathematical methods of statistics. Vol. 26. Princeton university press.
[13]
Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In ESA. 605--617.
[14]
Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In SIGCOMM. 153--166.
[15]
Ronald A Fisher and Frank Yates. 1938. Statistical tables: For biological, agricultural and medical research. Oliver and Boyd.
[16]
P. Flajolet. 1990. On Adaptive Sampling. Computing 43, 4 (1990), 391--400.
[17]
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings (2007).
[18]
Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. JCSS 31, 2 (1985), 182--209.
[19]
Phillip B Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, Vol. 1. 541--550.
[20]
Frédéric Giroire. 2009. Order statistics and estimating cardinalities of massive data sets. DAM 157, 2 (2009), 406--427.
[21]
Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experimental survey. PVLDB 11, 4 (2017), 499--512.
[22]
Daniel M Kane, Jelani Nelson, and David PWoodruff. 2010. An optimal algorithm for the distinct elements problem. In PODS. 41--52.
[23]
Matti Karppa and Rasmus Pagh. 2022. HyperLogLogLog: Cardinality Estimation With One Log More. In ACM SIGKDD. 753--761.
[24]
S Sathiya Keerthi, Dennis DeCoste, and Thorsten Joachims. 2005. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research 6, 3 (2005).
[25]
Jérôme Kunegis, Gerd Gröner, and Thomas Gottron. 2012. Online dating recommender systems: The split-complex number approach. In Proceedings of the 4th ACM RecSys workshop on Recommender systems and the social web. 37--44.
[26]
Jakub Lemiesz. 2021. On the algebra of data sketches. PVLDB 14, 9 (2021), 1655--1667.
[27]
Jakub Lemiesz. 2023. Efficient framework for operating on data sketches. PVLDB 16, 8 (2023), 1967--1978.
[28]
David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR 5, Apr (2004), 361--397.
[29]
Ping Li and Christian König. 2010. b-Bit minwise hashing. In WWW. 671--680.
[30]
Xiaohui Long and Torsten Suel. 2005. Three-level caching for efficient query processing in large web search engines. In WebConf. 257--266.
[31]
Rifat Ozcan, Ismail Sengor Altingovde, and Özgür Ulusoy. 2011. Cost-aware strategies for query result caching in web search engines. ACM TWEB 5, 2 (2011), 1--25.
[32]
Christopher R Palmer, Georgos Siganos, Michalis Faloutsos, Christos Faloutsos, and Phillip B Gibbons. 2001. The connectivity and fault-tolerance of the Internet topology. (2001).
[33]
Yiyan Qi, Pinghui Wang, Yuanming Zhang, Junzhou Zhao, Guangjian Tian, and Xiaohong Guan. 2020. Fast generating a large number of gumbel-max variables. In WebConf. 796--807.
[34]
Daniel Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In ACM SIGKDD. 442--451.
[35]
Daniel Ting. 2016. Towards optimal cardinality estimation of unions and intersections with sketches. In ACM SIGKDD. 1195--1204.
[36]
De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In IEEE CollaborateCom. 40--49.
[37]
Pinghui Wang, Xiaohong Guan, Tao Qin, and Qiuzhen Huang. 2011. A data streaming method for monitoring host connection degrees of high-speed links. IEEE TIFS 6, 3 (2011), 1086--1098.
[38]
Pinghui Wang, Dongdong Xie, Junzhou Zhao, Jinsong Li, Zhicheng Li, Rundong Li, and Yang Ren. 2024. Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables. IEEE Transactions on Knowledge and Data Engineering (2024).
[39]
Wu Wei, Bin Li, Chen Ling, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In WebConf. 1035--1043.
[40]
Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM TODS 15, 2 (1990), 208--229.
[41]
Viktor Witkovsky. 2001. Computing the distribution of a linear combination of inverted gamma variables. Science Direct Working Paper S1574-0358 (2001), 04.
[42]
Qingjun Xiao, Shigang Chen, Min Chen, and Yibei Ling. 2015. Hyper-compact virtual estimators for big network data based on register sharing. In ACM SIGMETRICS. 417--428.
[43]
Qingjun Xiao, You Zhou, and Shigang Chen. 2017. Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In IEEE INFOCOM. 1--9.
[44]
M Yoon, Tao Li, Shigang Chen, and J-K Peir. 2009. Fit a spread estimator in small memory. In IEEE INFOCOM. 504--512.
[45]
Yuanming Zhang, Pinghui Wang, Yiyan Qi, Kuankuan Cheng, Junzhou Zhao, Guangjian Tian, and Xiaohong Guan. 2023. Fast Gumbel-Max Sketch and its Applications. IEEE TKDE (2023).
[46]
Qi Zhao, Abhishek Kumar, and Jun Xu. 2005. Joint Data Streaming and Sampling Techniques for Detection of Super Sources and Destinations. In ACM SIGCOMM. 77--90.

Index Terms

  1. QSketch: An Efficient Sketch for Weighted Cardinality Estimation in Streams
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
      August 2024
      6901 pages
      ISBN:9798400704901
      DOI:10.1145/3637528
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 August 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. sketch
      2. streaming algorithms
      3. weighted cardinality estimation

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      KDD '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 83
        Total Downloads
      • Downloads (Last 12 months)83
      • Downloads (Last 6 weeks)22
      Reflects downloads up to 28 Nov 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media