Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/375663.375685acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Independence is good: dependency-based histogram synopses for high-dimensional data

Published: 01 May 2001 Publication History

Abstract

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.

References

[1]
Y. M.M. Bishop, S. E. Fienberg, and P. W. Holland. "Discrete Multivariate Analysis". The MIT Press, 1975.]]
[2]
J. R.S. Blair and B. Peyton. "An Introduction to Chordal Graphs and Clique Trees". In "Graph Theory and Sparse Matrix Computation". Springer-Verlag NY, Inc., 1993.]]
[3]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. "Approximate Query Processing Using Wavelets". In Proc. of the 26th Intl. Conf. on Very Large Data Bases, 2000.]]
[4]
R. Christensen. "Log-Linear Models and Logistic Regression". Springer-Verlag NY, Inc. (Springer Series in Statistics), 1997.]]
[5]
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. "Introduction to Algorithms". MIT Press, 1990.]]
[6]
J.N. Darroch, S.L. Lauritzen, and T.P. Speed. "Markov Fields and Log-Linear Interaction Models for Contigency Tables". The Annals of Statistics, 8(3):522-539, 1980.]]
[7]
A. Deshpande, M. Garofalakis, and R. Rastogi. "Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data". Bell Labs Tech. Report, 2001.]]
[8]
D. Edwards. Personal Communication, September 2000.]]
[9]
C. Faloutsos and I. Kamel. "Relaxing the Uniformity and Independence Assumptions Using the Concept of Fractal Dimension". Journal of Computer and Systems Sciences, 55(2):229-240, 1997.]]
[10]
B. Fox. "Discrete Optimization Via Marginal Analysis". Management Science, 13(3):211-216, 1966.]]
[11]
D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. "Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes". In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, 2000.]]
[12]
T. Ibaraki and N. Katoh. "Resource Allocation Problems - Algorithmic Approaches". MIT Press, 1988.]]
[13]
Y. E. Ioannidis and S. Christodoulakis. "Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results". ACM Trans. on Database Systems, 18(4):709-748, 1993.]]
[14]
Finn V. Jensen and Frank Jensen. "Optimal Junction Trees". In Proc. of the 10th Annual Conf. on Uncertainty in AI, 1994.]]
[15]
Francesco M. Malvestuto. "Approximating Discrete Probability Distributions with Decomposable Models". IEEE Trans. on Systems, Man, and Cybernetics, 21(5):1287-1294, 1991.]]
[16]
S. Muthukrishnan, V. Poosala, and T. Suel. "On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications". In Proc. of the Intl. Conf. on Database Theory, 1999.]]
[17]
J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference". Morgan Kaufmann Publishers, Inc., 1988.]]
[18]
V. Poosala and Y. E. Ioannidis. "Selectivity Estimation Without the Attribute Value Independence Assumption". In Proc. of the 23rd Intl. Conf. on Very Large Data Bases, 1997.]]
[19]
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. "Improved Histograms for Selectivity Estimation of Range Predicates". In Proc. of the 1996 ACM SIGMOD Intl. Conf. on Management of Data, 1996.]]
[20]
J. Whittaker. "Graphical Models in Applied Multivariate Statistics". John Wiley & Sons, Inc., 1990.]]

Cited By

View all
  • (2024)A Cause-Focused Query Optimizer Alert SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679771(2981-2990)Online publication date: 21-Oct-2024
  • (2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
  • (2022)QueryFormerProceedings of the VLDB Endowment10.14778/3529337.352934915:8(1658-1670)Online publication date: 22-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS01
Sponsor:

Acceptance Rates

SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Cause-Focused Query Optimizer Alert SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679771(2981-2990)Online publication date: 21-Oct-2024
  • (2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
  • (2022)QueryFormerProceedings of the VLDB Endowment10.14778/3529337.352934915:8(1658-1670)Online publication date: 22-Jun-2022
  • (2022)Mobile-app privacy nutrition labels missing key ingredients for successCommunications of the ACM10.1145/356396765:11(26-28)Online publication date: 20-Oct-2022
  • (2022)Building the SHAKTI microprocessorCommunications of the ACM10.1145/355663265:11(48-51)Online publication date: 20-Oct-2022
  • (2022)Cybersecurity in IndiaCommunications of the ACM10.1145/355491065:11(98-102)Online publication date: 20-Oct-2022
  • (2022)Impactful research and tooling for program correctnessCommunications of the ACM10.1145/355166565:11(52-53)Online publication date: 20-Oct-2022
  • (2022)Computing and assistive technology solutions for the visually impairedCommunications of the ACM10.1145/355163465:11(44-47)Online publication date: 20-Oct-2022
  • (2022)Theory research in IndiaCommunications of the ACM10.1145/355049665:11(88-93)Online publication date: 20-Oct-2022
  • (2022)Toward explainable deep learningCommunications of the ACM10.1145/355049165:11(68-69)Online publication date: 20-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media