Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Cardinality estimation: an experimental survey

Published: 01 December 2017 Publication History

Abstract

Data preparation and data profiling comprise many both basic and complex tasks to analyze a dataset at hand and extract metadata, such as data distributions, key candidates, and functional dependencies. Among the most important types of metadata is the number of distinct values in a column, also known as the zeroth-frequency moment. Cardinality estimation itself has been an active research topic in the past decades due to its many applications. The aim of this paper is to review the literature of cardinality estimation and to present a detailed experimental study of twelve algorithms, scaling far beyond the original experiments.
First, we outline and classify approaches to solve the problem of cardinality estimation - we describe their main idea, error-guarantees, advantages, and disadvantages. Our experimental survey then compares the performance all twelve cardinality estimation algorithms. We evaluate the algorithms' accuracy, runtime, and memory consumption using synthetic and real-world datasets. Our results show that different algorithms excel in different in categories, and we highlight their trade-offs.

References

[1]
Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015.
[2]
C. C. Aggarwal and S. Y. Philip. A survey of synopsis construction in data streams. In Data Streams, pages 169--207. Springer, 2007.
[3]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 20--29, 1996.
[4]
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1--10. Springer, 2002.
[5]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 623--632. Society for Industrial and Applied Mathematics, 2002.
[6]
K. Beyer, R. Gemulla, P. J. Haas, B. Reinwald, and Y. Sismanis. Distinct-value synopses for multiset operations. Communications of the ACM, 52(10):87--95, 2009.
[7]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 199--210, 2007.
[8]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 268--279, 2000.
[9]
G. Cormode. Count-Min sketch. In Encyclopedia of Database Systems, pages 511--516. Springer, 2009.
[10]
G. Cormode and S. Muthukrishnan. An improved data stream summary: the Count-Min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
[11]
M. Durand and P. Flajolet. LogLog counting of large cardinalities. In European Symposium on Algorithms, pages 605--617. Springer, 2003.
[12]
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the ACM SIGCOMM Internet Measurement Conference (IMC), pages 153--166, 2003.
[13]
P. Flajolet. On adaptive sampling. Computing, 43(4):391--400, 1990.
[14]
P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics and Theoretical Computer Science (DMTCS) Proceedings, AH(1):127--146, 2008.
[15]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.
[16]
M. Garofalakis, J. Gehrke, and R. Rastogi. Data Stream Management: A Brave New World, pages 1--9. Springer Berlin Heidelberg, Berlin, Heidelberg, 2016.
[17]
P. Gibbons, C. Faloutsos, M. Faloutsos, C. Palmer, and G. Siganos. The connectivity and fault tolerance of the internet topology. In Workshop on Network-Related Data Management; in cooperation with ACM Special Interest Group on Management of Data/Principles of Database Systems, volume 25, 2001.
[18]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the International Conference on Very Large Databases (VLDB), volume 1, pages 541--550, 2001.
[19]
P. B. Gibbons. Data Stream Management: Processing High-Speed Data Streams, chapter Distinct-values estimation over data streams. Springer, 2007.
[20]
P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 281--291, 2001.
[21]
F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009.
[22]
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), volume 95, pages 311--322, 1995.
[23]
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. Journal of the American Statistical Association, 93(444):1475--1487, 1998.
[24]
S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 683--692, 2013.
[25]
A. Kumar, J. Xu, and J. Wang. Space-code Bloom filter for efficient per-flow traffic measurement. IEEE Journal on Selected Areas in Communications, 24(12):2327--2339, 2006.
[26]
M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. Transactions on Modeling and Computer Simulation (TOMACS), 8(1):3--30, 1998.
[27]
A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 618--629, 2008.
[28]
O. Papapetrou, W. Siberski, and W. Nejdl. Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases, 28(2):119--156, 2010.
[29]
T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with Metanome (demo). PVLDB, 8(12):1860--1871, 2015.
[30]
S. A. Singh and S. Tirthapura. An evaluation of streaming algorithms for distinct counting over a sliding window. Frontiers in ICT, 2:23, 2015.
[31]
S. J. Swamidass and P. Baldi. Mathematical correction for fingerprint similarity measures to improve chemical retrieval. Journal of chemical information and modeling, 47(3):952--964, 2007.
[32]
K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS), 15(2):208--229, 1990.
[33]
K. Youssefi and E. Wong. Query processing in a relational database management system. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 409--417, 1979.

Cited By

View all

Index Terms

  1. Cardinality estimation: an experimental survey
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 11, Issue 4
      December 2017
      133 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 December 2017
      Published in PVLDB Volume 11, Issue 4

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)232
      • Downloads (Last 6 weeks)34
      Reflects downloads up to 30 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
      • (2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
      • (2024)Identifying the Root Causes of DBMS SuboptimalityACM Transactions on Database Systems10.1145/363642549:1(1-40)Online publication date: 28-Feb-2024
      • (2024)Thorough Data Pruning for Join Query in Database SystemIEEE Transactions on Sustainable Computing10.1109/TSUSC.2023.32793829:3(409-421)Online publication date: May-2024
      • (2024)Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big TablesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335971036:7(3111-3125)Online publication date: Jul-2024
      • (2024)On the Security of Quotient Filters: Attacks and Potential CountermeasuresIEEE Transactions on Computers10.1109/TC.2024.337179373:9(2165-2177)Online publication date: 1-Sep-2024
      • (2024)A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00110(1338-1351)Online publication date: 13-May-2024
      • (2024)Duet: Efficient and Scalable Hybrid Neural Relation Understanding2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00012(56-69)Online publication date: 13-May-2024
      • (2024)Parallel Query Processing in Database on Multicore System2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS)10.1109/EIECS63941.2024.10800366(636-640)Online publication date: 27-Sep-2024
      • (2024)A generic sketch for estimating super-spreaders and per-flow cardinality distribution in high-speed data streamsComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.110059237:COnline publication date: 4-Mar-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media