Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A linear-time probabilistic counting algorithm for database applications

Published: 01 June 1990 Publication History

Abstract

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

References

[1]
ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K.-Y. Approximating the number of unique values of an attribute without sorting. Inf. Syst. 12, 1 (1987), 11-15.
[2]
BITTON, D., AND DEWITT, D.J. Duplicate record elimination in large data files. ACM Trans. Database Syst. 8, 2 (June 1983), 255-265.
[3]
CONTE, S. D., AND DE BOOR, C. Elementary Numerical Analysis: An Algorithmic Approach. 3rd ed., McGraw-Hill, New York, 1980.
[4]
DEMOLOMBE, R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language. In Proceedings of the 6th International Conference on Very Large Data Bases, 1980, pp. 55-63.
[5]
FELLER, W. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed., Wiley, New York, 1968.
[6]
GELENBE, R., AND GARDY, D. The size of projections of relations satisfying a functional dependency. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City, 1982), pp. 325-333.
[7]
JOHNSON, N. L., AND KOTZ, S. Urn Models and Their Application. Wiley, New York, 1977.
[8]
MOOD, A. M., GRAYBILL, F. A., AND BOES, D.C. Introduction to the Theory of Statistics. 3rd ed., McGraw-Hill, New York, 1974.
[9]
OLKEN, F., AND ROTEM, D. Simple random sampling from relational databases. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Aug. 1986), pp. 160-169.
[10]
ROWE, N.C. Antisampling for estimation: An overview. IEEE Trans. Softw. Eng. SE-11, 10 (Oct. 1985), 1081-1091.
[11]
SELINGER, P., ET AL. Access path selection in a relational database management system. In Proceedings of the International Conference on Management of Data, A CM SIGMOD, 1979, pp. 23-34.
[12]
SHERMAN, B. A random variable related to the spacing of sample values. Ann. Math. Statistics 21 (1950), 339-361.
[13]
VANDER-ZANDEN, B. W., TAYLOR, H. M., AND BITTON, D. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Aug. 1986), pp. 119-127.
[14]
WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Estimating block accesses in database organizations: A closed noniterative formula. Commum. ACM, 26, 11 (Nov. 1983), 940-944.
[15]
WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Separability--An approach to physical database design. IEEE Trans. Comput. C-33, 3 (Mar. 1984), 209-222.
[16]
WIEDERHOLD, G., AND EL-MARS1, R. The structural model for database design, in Proceedings of the International Conference on Entity Relationship Approach (Los Angeles, Dec. 1979), pp. 247-267.
[17]
W|EDERHOLD, G. Database Design. 2nd ed., McGraw-Hill, New York, 1983.
[18]
YOUSSEFI, K., AND WONG, E. Query processing in a relational database management system. In Proceedings of the 5th International Conference on Very Large Data Bases, 1979, pp. 409-417.
[19]
Yu, C. T., LAM, K., SIu, M. K., AND OZSOYOGLU, M. Performance analysis of three related assignment problems. In Proceedings of the International Conference on Management of Data, A CM SIGMOD, 1979, pp. 82-92.

Cited By

View all
  • (2024)ComPipe: A Novel Flow Placement and Measurement Algorithm for Programmable Composite PipelinesElectronics10.3390/electronics1306102213:6(1022)Online publication date: 8-Mar-2024
  • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
  • (2024)Enhancing Accuracy for Super Spreader Identification in High-Speed Data StreamsProceedings of the VLDB Endowment10.14778/3681954.368198817:11(3124-3137)Online publication date: 30-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 15, Issue 2
June 1990
183 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/78922
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1990
Published in TODS Volume 15, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)598
  • Downloads (Last 6 weeks)68
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ComPipe: A Novel Flow Placement and Measurement Algorithm for Programmable Composite PipelinesElectronics10.3390/electronics1306102213:6(1022)Online publication date: 8-Mar-2024
  • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
  • (2024)Enhancing Accuracy for Super Spreader Identification in High-Speed Data StreamsProceedings of the VLDB Endowment10.14778/3681954.368198817:11(3124-3137)Online publication date: 30-Aug-2024
  • (2024)UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct CountingProceedings of the VLDB Endowment10.14778/3654621.365463217:7(1655-1668)Online publication date: 1-Mar-2024
  • (2024)Eagle: Toward Scalable and Near-Optimal Network-Wide Sketch Deployment in Network MeasurementProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672244(291-310)Online publication date: 4-Aug-2024
  • (2024)Raising the Level of Abstraction for Sketch-Based Network Telemetry with SketchPlanProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689016(651-658)Online publication date: 4-Nov-2024
  • (2024)An LDP Compatible Sketch for Securely Approximating Set Intersection CardinalitiesProceedings of the ACM on Management of Data10.1145/36392812:1(1-27)Online publication date: 26-Mar-2024
  • (2024)QSketch: An Efficient Sketch for Weighted Cardinality Estimation in StreamsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671695(2432-2443)Online publication date: 25-Aug-2024
  • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
  • (2024)Multi-Resolution Odd Sketch for Mining Extended Jaccard Similarity of Dynamic Streaming setsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327580911:3(2399-2414)Online publication date: May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media