Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Indexing for summary queries: Theory and practice

Published: 06 January 2014 Publication History

Abstract

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.
However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this article, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, and various sketches, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost. We illustrate the efficiency and usefulness of our designs through extensive experiments and a system demonstration.

References

[1]
P. Afshani, G. S. Brodal, and N. Zeh. 2011. Ordered and unordered top-k range reporting in large data sets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
[2]
P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. 2012. Mergeable summaries. In Proceedings of the ACM Symposium on Principles of Database Systems.
[3]
P. K. Agarwal and J. Erickson. 1999. Geometric range searching and its relatives. In Advances in Discrete and Computational Geometry. American Mathematical Society, 1--56.
[4]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. 2002. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64, 3, 719--747.
[5]
N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
[6]
A. Arasu and G. Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems.
[7]
L. Arge and J. S. Vitter. 2003. Optimal external memory interval management. SIAM J. Comput. 32, 6, 1488--1508.
[8]
R. A. Baeza-Yates and G. H. Gonnet. 1991. Handbook of Algorithms and Data Structures. Addison-Wesley.
[9]
Z. Bar-Yossef. 2002. The complexity of massive data set computations. Ph.D. Dissertation, University of California at Berkeley.
[10]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[11]
G. S. Brodal, B. Gfeller, A. G. Jørgensen, and P. Sanders. 2011. Towards optimal range medians. Theor. Comput. Sci. 412, 1, 2588--2601.
[12]
F. Buccafurri, G. Lax, D. Sacca, L. Pontieri, and D. Rosaci. 2008. Enhancing histograms by tree-like bucket indices. VLDB J. 17, 5, 1041--1061.
[13]
S. Chaudhuri, G. Das, and U. Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[14]
S. Chaudhuri, R. Motwani, and V. Narasayya. 1998. Random sampling for histogram construction: How much is enough? In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[15]
G. Cormode and M. Hadjieleftheriou. 2008. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases.
[16]
G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1, 58--75.
[17]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
[18]
J. Gehrke, F. Korn, and D. Srivastava. 2001. On computing correlated aggregates over continual data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[19]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases.
[20]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining Knowl. Discov. 1, 1, 29--53.
[21]
M. Greenwald and S. Khanna. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[22]
J. Hellerstein, P. Haas, and H. Wang. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[23]
Z. Huang, L. Wang, K. Yi, and Y. Liu. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[24]
C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. 2008. Scalable approximate query processing with the dbo engine. ACM Trans. Datab. Syst. 33, 4.
[25]
A. Jørgensen and K. Larsen. 2011. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.
[26]
X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. 2007. Selecting stars: The k most representative skyline operator. In Proceedings of the IEEE International Conference on Data Engineering.
[27]
A. McGregor, A. Pavan, S. Tirthapura, and D. P. Woodruff. 2012. Space-efficient estimation of statistics over sub-sampled streams. In Proceedings of the ACM Symposium on Principles of Database Systems.
[28]
A. Metwally, D. Agrawal, and A. Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133.
[29]
J. Misra and D. Gries. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.
[30]
J. I. Munro and M. S. Paterson. 1980. Selection and sorting with limited storage. Theor. Comput. Sci. 12, 3, 315--323.
[31]
S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. Now Publishers.
[32]
H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
[33]
N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). ACM.
[34]
Y. Tao, L. Ding, X. Lin, and J. Pei. 2009. Distance-based representative skyline. In Proceedings of the IEEE International Conference on Data Engineering.
[35]
Y. Tao, G. Kollios, J. Considine, F. Li, and Papadias. 2004. Spatio-temporal aggregation using sketches. In Proceedings of the IEEE International Conference on Data Engineering.
[36]
V. N. Vapnik and A. Y. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 2, 264--280.
[37]
J. S. Vitter. 2008. Algorithms and Data Structures for External Memory. Now Publishers.
[38]
Z. Wei and K. Yi. 2011. Beyond simple aggregates: Indexing for summary queries. In Proceedings of the ACM Symposium on Principles of Database Systems.

Cited By

View all
  • (2021)Data-Independent Space Partitionings for SummariesProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458316(285-298)Online publication date: 20-Jun-2021
  • (2020)CoopStoreProceedings of the VLDB Endowment10.14778/3407790.340781713:12(2174-2187)Online publication date: 14-Sep-2020
  • (2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 39, Issue 1
January 2014
317 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2576988
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 January 2014
Accepted: 01 July 2013
Revised: 01 March 2013
Received: 01 July 2012
Published in TODS Volume 39, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Indexing
  2. summary queries

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Data-Independent Space Partitionings for SummariesProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458316(285-298)Online publication date: 20-Jun-2021
  • (2020)CoopStoreProceedings of the VLDB Endowment10.14778/3407790.340781713:12(2174-2187)Online publication date: 14-Sep-2020
  • (2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
  • (2017)Fast counting the cardinality of flows for big traffic over sliding windowsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6053-x11:1(119-129)Online publication date: 1-Feb-2017
  • (2016)Dynamic range majority data structuresTheoretical Computer Science10.1016/j.tcs.2016.07.039647:C(59-73)Online publication date: 27-Sep-2016
  • (2015)Spatial online sampling and aggregationProceedings of the VLDB Endowment10.14778/2850583.28505849:3(84-95)Online publication date: 1-Nov-2015

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media