Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Closer Look at Variance Implementations In Modern Database Systems

Published: 11 May 2017 Publication History

Abstract

Variance is a popular and often necessary component of aggregation queries. It is typically used as a secondary measure to ascertain statistical properties of the result such as its error. Yet, it is more expensive to compute than primary measures such as SUM, MEAN, and COUNT.
There exist numerous techniques to compute variance. While the definition of variance implies two passes overthe data, other mathematical formulations lead to a singlepass computation. Some single-pass formulations, however, can suffer from severe precision loss, especially for large datasets.
In this paper, we study variance implementations in various real-world systems and find that major database systems such as PostgreSQL and most likely System X, a major commercial closed-source database, use a representation that is efficient, but suffers from floating point precision loss resulting from catastrophic cancellation. We review literature over the past five decades on variance calculation in both the statistics and database communities, and summarize recommendations on implementing variance functions in various settings, such as approximate query processing and large-scale distributed aggregation. Interestingly, we recommend using the mathematical formula for computing variance if two passes over the data are acceptable due to its precision, parallelizability, and surprisingly computation speed.

References

[1]
T. Chan et al. Algorithms for Computing the Sample Variance: Analysis and Recommendations. Am. Stat., 1983.
[2]
R. J. Hanson. Stably Updating Mean and Standard Deviation of Data. ACM, 1975.
[3]
J. M. Hellerstein, C. Ré, F. Schoppmann, et al. The MADlib Analytics Library: or MAD Skills, the SQL. VLDB, 2012.
[4]
N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, 2002.
[5]
W. Kahan. Further Remarks on Reducing Truncation Errors. ACM, 8(1):40, 1965.
[6]
N. Kamat and A. Nandi. A Closer Look at Variance Implementations In Modern Database Systems. Arxiv TR, 2016.
[7]
S. Kandel et al. Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment. AVI, 2012.
[8]
D. E. Knuth. Art of Computer Programming, Volume 2: Seminumerical Algorithms. 2014.
[9]
J. Rogers, J. Filliben, et al. StRD: Statistical Reference Datasets for Testing the Numerical Accuracy of Statistical Software, 1998.
[10]
Y. Tian, S. Tatikonda, and B. Reinwald. Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE, 2012.
[11]
B. Welford. Note on a Method for Calculating Corrected Sums of Squares and Products. Technometrics, 1962.
[12]
D. West. Updating Mean and Variance Estimates: An Improved Method. 1979.
[13]
E. A. Youngs et al. Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms. Technometrics, 1971.

Cited By

View all
  • (2019)Wilkinson's Tests and SQL PackagesACM SIGMOD Record10.1145/3377391.337739548:3(17-22)Online publication date: 20-Dec-2019
  • (2019)Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets2019 IEEE International Congress on Big Data (BigDataCongress)10.1109/BigDataCongress.2019.00023(77-84)Online publication date: Jul-2019
  • (2018)Numerically stable parallel computation of (co-)varianceProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3223036(1-12)Online publication date: 9-Jul-2018
  • Show More Cited By
  1. A Closer Look at Variance Implementations In Modern Database Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 45, Issue 4
    December 2016
    48 pages
    ISSN:0163-5808
    DOI:10.1145/3092931
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 May 2017
    Published in SIGMOD Volume 45, Issue 4

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Wilkinson's Tests and SQL PackagesACM SIGMOD Record10.1145/3377391.337739548:3(17-22)Online publication date: 20-Dec-2019
    • (2019)Distributed, Numerically Stable Distance and Covariance Computation with MPI for Extremely Large Datasets2019 IEEE International Congress on Big Data (BigDataCongress)10.1109/BigDataCongress.2019.00023(77-84)Online publication date: Jul-2019
    • (2018)Numerically stable parallel computation of (co-)varianceProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3223036(1-12)Online publication date: 9-Jul-2018
    • (2017)A Unified Correlation-based Approach to Sampling Over JoinsProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085524(1-12)Online publication date: 27-Jun-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media