research-article

Bi-Level Online Aggregation on Raw Data

Authors:

Florin RusuAuthors Info & Claims

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

Article No.: 10, Pages 1 - 12

https://doi.org/10.1145/3085504.3085514

Published: 27 June 2017 Publication History

Abstract

In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. The performance of in-situ processing across a query workload is, however, limited by the speed of full scan, tokenizing, and parsing of the entire data. Online aggregation (OLA) has been introduced as an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing---the computation can be stopped as early as the estimate is accurate enough to be deemed uninteresting. However, existing OLA solutions have a high upfront cost of randomly shuffling and/or sampling the data.

In this paper, we present OLA-RAW, a bi-level sampling scheme for parallel online aggregation over raw data. Sampling in OLA-RAW is query-driven and performed exclusively in-situ during the runtime query execution, without data reorganization. This is realized by a novel resource-aware bi-level sampling algorithm that processes data in random chunks concurrently and determines adaptively the number of sampled tuples inside a chunk. In order to avoid the cost of repetitive conversion from raw data, OLA-RAW builds and maintains a memory-resident bi-level sample synopsis incrementally. We implement OLA-RAW inside a modern in-situ data processing system and evaluate its performance across several real and synthetic datasets and file formats. Our results show that OLA-RAW chooses the sampling plan that minimizes the execution time and guarantees the required accuracy for each query in a given workload. The end result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.

References

[1]

A. Abouzied et al. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT/ICDT 2013.

Digital Library

[2]

S. Agarwal et al. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB 5, 12 (2012).

Digital Library

[3]

S. Agarwal et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD 2014.

Digital Library

[4]

I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD 2012.

Digital Library

[5]

R. Avnur et al. CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. In SIGMOD 1998.

Digital Library

[6]

S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel Data Analysis Directly on Scientific File Formats. In SIGMOD 2014.

Digital Library

[7]

S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of Block-Level Sampling in Statistics Estimation. In SIGMOD 2004.

Digital Library

[8]

S. Chen et al. PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. In SIGMOD 2010.

Digital Library

[9]

Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD 2014.

Digital Library

[10]

Y. Cheng and F. Rusu. SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading. ACM TODS 40, 3 (2015).

Digital Library

[11]

Y. Cheng, W. Zhao, and F. Rusu. OLA-RAW: Scalable Exploration over Raw Data. arXiv 1702.00358. (2017).

[12]

W. Cochran. 1977. Sampling Techniques. Wiley.

[13]

T. Condie et al. MapReduce Online. In NSDI 2010.

Digital Library

[14]

G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1--3 (2012).

Digital Library

[15]

B. Ding et al. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. In SIGMOD 2016.

Digital Library

[16]

A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-Charging Estimate Convergence in DBO. PVLDB 2, 1 (2009).

Digital Library

[17]

M. Garofalakis and P. Gibbon. Approximate Query Processing: Taming the TeraBytes. In VLDB 2001.

Digital Library

[18]

P. Haas. Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In SSDBM 1997.

Digital Library

[19]

P. Haas and J. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD 1999.

Digital Library

[20]

P. Haas and C. König. A Bi-Level Bernoulli Scheme for Database Sampling. In SIGMOD 2004.

Digital Library

[21]

J. Hellerstein, P. Haas, and H. Wang. Online Aggregation. In SIGMOD 1997.

Digital Library

[22]

S. Idreos et al. Here Are My Data Files. Here Are My Queries. Where Are My Results? In CIDR 2011.

[23]

M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM 2012.

Digital Library

[24]

C. Jermaine et al. Online Estimation for Subset-Based SQL Queries. In VLDB 2005.

Digital Library

[25]

C. Jermaine et al. Scalable Approximate Query Processing with the DBO Engine. In SIGMOD 2007.

Digital Library

[26]

C. Jermaine et al. The Sort-Merge-Shrink Join. ACM TODS 31, 4 (2006).

Digital Library

[27]

S. Kandula et al. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In SIGMOD 2016.

Digital Library

[28]

M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB 9, 12 (2016).

Digital Library

[29]

M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR 2015.

[30]

M. Karpathiotakis et al. Adaptive Query Processing on RAW Data. PVLDB 7, 12 (2014).

Digital Library

[31]

M. Kornacker et al. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015.

[32]

N. Laptev, K. Zeng, and C. Zaniolo. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB 5, 10 (2012).

Digital Library

[33]

N. Law et al. The Palomar Transient Factory: System Overview, Performance and First Results. arXiv 0906.5350 (2009).

[34]

F. Li, B. Wu, K. Yi, and Z. Zhao. Wander Join: Online Aggregation via Random Walks. In SIGMOD 2016.

Digital Library

[35]

G. Luo, C. Ellmann, P. Haas, and J. Naughton. A Scalable Hash Ripple Join Algorithm. In SIGMOD 2002.

Digital Library

[36]

T. Muhlbauer, W. Rodiger, R. Seilbeck et al. Instant Loading for Main Memory Databases. PVLDB 6, 14 (2013).

Digital Library

[37]

F. Olken. Random Sampling from Databases. (1993). UC Berkeley.

[38]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB 4, 11 (2011).

[39]

C. Qin and F. Rusu. PF-OLA: A High-Performance Framework for Parallel Online Aggregation. DAPD 32, 3 (2014).

Digital Library

[40]

F. Rusu, F. Xu, L. Perez, M. Wu, R. Jampani, C. Jermaine, and A. Dobra. The DBO Database System. In SIGMOD 2008.

Digital Library

[41]

S. Thompson. 2012. Sampling. Wiley.

[42]

J. Vitter. Random Sampling with a Reservoir. ACM TOMS. 11, 1 (1985).

Digital Library

[43]

L. Wang, R. Christensen, F. Li, and K. Yi. Spatial Online Sampling and Aggregation. PVLDB 9, 3 (2016).

Digital Library

[44]

M. Wu and C. Jermaine. A Bayesian Method for Guessing the Extreme Values in a Data Set. In VLDB 2007.

Digital Library

[45]

S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed Online Aggregation. PVLDB 2, 1 (2009).

Digital Library

[46]

S. Wu et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD 2010.

Digital Library

[47]

K. Zeng, S. Agarwal, and I. Stoica. iOLAP: Managing Uncertainty for Efficient Incremental OLAP. In SIGMOD 2016.

Digital Library

[48]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD 2014.

Digital Library

[49]

W. Zhao, Y. Cheng, and F. Rusu. Vertical Partitioning for Query Processing over Raw Data. In SSDBM 2015.

Digital Library

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
https://doi.org/10.1007/s00778-024-00845-0
Show More Cited By

Index Terms

Bi-Level Online Aggregation on Raw Data
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

AQapprox: Aggregation Queries Approximation with Distribution-Aware Online Sampling
Web Information Systems Engineering – WISE 2020
Abstract
Approximate query processing (AQP) is an effective way to provide approximate results for SQL queries, which relaxing accuracy in exchange for higher processing speed. In sampling-based AQP techniques, random sampling works well for uniformly ...
A bi-level Bernoulli scheme for database sampling
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods ...
Adaptive query processing on RAW data

Database systems deliver impressive performance for large classes of workloads as the result of decades of research into optimizing database engines. High performance, however, is achieved at the cost of versatility. In particular, database systems only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

June 2017

373 pages

ISBN:9781450352826

DOI:10.1145/3085504

General Chair:
Alok Choudhary,
Program Chair:
Kesheng Wu,
Publications Chair:
Bin Dong

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Northwestern University: Northwestern University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM '17

SSDBM '17: 29th International Conference on Scientific and Statistical Database Management

June 27 - 29, 2017

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rodriges Zalipynis R(2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
https://doi.org/10.3390/earth5030027
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
https://doi.org/10.1007/s00778-024-00845-0
Patil ARicher GJermaine CMoritz DFekete J(2022)Studying Early Decision Making with Progressive Bar ChartsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209426(1-11)Online publication date: 2022
https://doi.org/10.1109/TVCG.2022.3209426
Zalipynis R(2021)Array DBMSProceedings of the VLDB Endowment10.14778/3476311.347640414:12(3186-3189)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476404
Rong KLu YBailis PKandula SLevis P(2020)Approximate partition selection for big-data workloads using summary statisticsProceedings of the VLDB Endowment10.14778/3407790.340784813:12(2606-2619)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407848
Huang DYoon DPettie SMozafari B(2019)Joins on samplesProceedings of the VLDB Endowment10.14778/3372716.337272613:4(547-560)Online publication date: 9-Dec-2019
https://dl.acm.org/doi/10.14778/3372716.3372726
Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Zalipynis R(2018)ChronosDBProceedings of the VLDB Endowment10.14778/3231751.323175411:10(1247-1261)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231754
Zhao WRusu FDong BWu KHo ANugent PSacharidis DGamper JBöhlen M(2018)Distributed caching for processing raw arraysProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3221295(1-12)Online publication date: 9-Jul-2018
https://dl.acm.org/doi/10.1145/3221269.3221295
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents