research-article

Indexing for summary queries: Theory and practice

Authors:

Zhewei WeiAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 39, Issue 1

Article No.: 2, Pages 1 - 39

https://doi.org/10.1145/2508702

Published: 06 January 2014 Publication History

Abstract

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.

However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this article, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, and various sketches, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost. We illustrate the efficiency and usefulness of our designs through extensive experiments and a system demonstration.

References

[1]

P. Afshani, G. S. Brodal, and N. Zeh. 2011. Ordered and unordered top-k range reporting in large data sets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.

Digital Library

[2]

P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. 2012. Mergeable summaries. In Proceedings of the ACM Symposium on Principles of Database Systems.

Digital Library

[3]

P. K. Agarwal and J. Erickson. 1999. Geometric range searching and its relatives. In Advances in Discrete and Computational Geometry. American Mathematical Society, 1--56.

[4]

N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. 2002. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64, 3, 719--747.

Digital Library

[5]

N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.

Digital Library

[6]

A. Arasu and G. Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the ACM Symposium on Principles of Database Systems.

Digital Library

[7]

L. Arge and J. S. Vitter. 2003. Optimal external memory interval management. SIAM J. Comput. 32, 6, 1488--1508.

Digital Library

[8]

R. A. Baeza-Yates and G. H. Gonnet. 1991. Handbook of Algorithms and Data Structures. Addison-Wesley.

Digital Library

[9]

Z. Bar-Yossef. 2002. The complexity of massive data set computations. Ph.D. Dissertation, University of California at Berkeley.

Digital Library

[10]

K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[11]

G. S. Brodal, B. Gfeller, A. G. Jørgensen, and P. Sanders. 2011. Towards optimal range medians. Theor. Comput. Sci. 412, 1, 2588--2601.

Digital Library

[12]

F. Buccafurri, G. Lax, D. Sacca, L. Pontieri, and D. Rosaci. 2008. Enhancing histograms by tree-like bucket indices. VLDB J. 17, 5, 1041--1061.

Digital Library

[13]

S. Chaudhuri, G. Das, and U. Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[14]

S. Chaudhuri, R. Motwani, and V. Narasayya. 1998. Random sampling for histogram construction: How much is enough&quest; In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[15]

G. Cormode and M. Hadjieleftheriou. 2008. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases.

Digital Library

[16]

G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1, 58--75.

Digital Library

[17]

M. Datar, A. Gionis, P. Indyk, and R. Motwani. 2002. Maintaining stream statistics over sliding windows. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.

Digital Library

[18]

J. Gehrke, F. Korn, and D. Srivastava. 2001. On computing correlated aggregates over continual data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[19]

A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases.

Digital Library

[20]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining Knowl. Discov. 1, 1, 29--53.

Digital Library

[21]

M. Greenwald and S. Khanna. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[22]

J. Hellerstein, P. Haas, and H. Wang. 1997. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[23]

Z. Huang, L. Wang, K. Yi, and Y. Liu. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[24]

C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. 2008. Scalable approximate query processing with the dbo engine. ACM Trans. Datab. Syst. 33, 4.

Digital Library

[25]

A. Jørgensen and K. Larsen. 2011. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms.

Digital Library

[26]

X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. 2007. Selecting stars: The k most representative skyline operator. In Proceedings of the IEEE International Conference on Data Engineering.

[27]

A. McGregor, A. Pavan, S. Tirthapura, and D. P. Woodruff. 2012. Space-efficient estimation of statistics over sub-sampled streams. In Proceedings of the ACM Symposium on Principles of Database Systems.

Digital Library

[28]

A. Metwally, D. Agrawal, and A. Abbadi. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133.

Digital Library

[29]

J. Misra and D. Gries. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.

[30]

J. I. Munro and M. S. Paterson. 1980. Selection and sorting with limited storage. Theor. Comput. Sci. 12, 3, 315--323.

[31]

S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science. Now Publishers.

Digital Library

[32]

H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.

Digital Library

[33]

N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). ACM.

Digital Library

[34]

Y. Tao, L. Ding, X. Lin, and J. Pei. 2009. Distance-based representative skyline. In Proceedings of the IEEE International Conference on Data Engineering.

Digital Library

[35]

Y. Tao, G. Kollios, J. Considine, F. Li, and Papadias. 2004. Spatio-temporal aggregation using sketches. In Proceedings of the IEEE International Conference on Data Engineering.

Digital Library

[36]

V. N. Vapnik and A. Y. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 2, 264--280.

[37]

J. S. Vitter. 2008. Algorithms and Data Structures for External Memory. Now Publishers.

Digital Library

[38]

Z. Wei and K. Yi. 2011. Beyond simple aggregates: Indexing for summary queries. In Proceedings of the ACM Symposium on Principles of Database Systems.

Digital Library

Cited By

Cormode GGarofalakis MShekelyan MLibkin LPichler RGuagliardo P(2021)Data-Independent Space Partitionings for SummariesProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458316(285-298)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458316
Gan EBailis PCharikar M(2020)CoopStoreProceedings of the VLDB Endowment10.14778/3407790.340781713:12(2174-2187)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407817
Agarwal PSintos SSteiger ASuciu DTao YWei Z(2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
https://dl.acm.org/doi/10.1145/3375395.3387667
Show More Cited By

Index Terms

Indexing for summary queries: Theory and practice

Recommendations

Beyond simple aggregates: indexing for summary queries
PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as ...
Structure-aware indexing for keyword search in databases
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Most of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact ...
Top-k spatial keyword queries on road networks
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

With the popularization of GPS-enabled devices there is an increasing interest for location-based queries. In this context, one interesting problem is processing top-k spatial keyword queries. Given a set of objects with a textual description (e.g., ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 39, Issue 1

January 2014

317 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/2576988

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 January 2014

Accepted: 01 July 2013

Revised: 01 March 2013

Received: 01 July 2012

Published in TODS Volume 39, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Research Grants Council, University Grants Committee, Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
745
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cormode GGarofalakis MShekelyan MLibkin LPichler RGuagliardo P(2021)Data-Independent Space Partitionings for SummariesProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458316(285-298)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458316
Gan EBailis PCharikar M(2020)CoopStoreProceedings of the VLDB Endowment10.14778/3407790.340781713:12(2174-2187)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407817
Agarwal PSintos SSteiger ASuciu DTao YWei Z(2020)Efficient Indexes for Diverse Top-k Range QueriesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387667(213-227)Online publication date: 14-Jun-2020
https://dl.acm.org/doi/10.1145/3375395.3387667
Shan JFu YNi GLuo JWu Z(2017)Fast counting the cardinality of flows for big traffic over sliding windowsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6053-x11:1(119-129)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s11704-016-6053-x
Elmasry AHe MMunro JNicholson P(2016)Dynamic range majority data structuresTheoretical Computer Science10.1016/j.tcs.2016.07.039647:C(59-73)Online publication date: 27-Sep-2016
https://dl.acm.org/doi/10.1016/j.tcs.2016.07.039
Wang LChristensen RLi FYi K(2015)Spatial online sampling and aggregationProceedings of the VLDB Endowment10.14778/2850583.28505849:3(84-95)Online publication date: 1-Nov-2015
https://dl.acm.org/doi/10.14778/2850583.2850584

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents