Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3085504.3085514acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Bi-Level Online Aggregation on Raw Data

Published: 27 June 2017 Publication History

Abstract

In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. The performance of in-situ processing across a query workload is, however, limited by the speed of full scan, tokenizing, and parsing of the entire data. Online aggregation (OLA) has been introduced as an efficient method for data exploration that identifies uninteresting patterns faster by continuously estimating the result of a computation during the actual processing---the computation can be stopped as early as the estimate is accurate enough to be deemed uninteresting. However, existing OLA solutions have a high upfront cost of randomly shuffling and/or sampling the data.
In this paper, we present OLA-RAW, a bi-level sampling scheme for parallel online aggregation over raw data. Sampling in OLA-RAW is query-driven and performed exclusively in-situ during the runtime query execution, without data reorganization. This is realized by a novel resource-aware bi-level sampling algorithm that processes data in random chunks concurrently and determines adaptively the number of sampled tuples inside a chunk. In order to avoid the cost of repetitive conversion from raw data, OLA-RAW builds and maintains a memory-resident bi-level sample synopsis incrementally. We implement OLA-RAW inside a modern in-situ data processing system and evaluate its performance across several real and synthetic datasets and file formats. Our results show that OLA-RAW chooses the sampling plan that minimizes the execution time and guarantees the required accuracy for each query in a given workload. The end result is a focused data exploration process that avoids unnecessary work and discards uninteresting data.

References

[1]
A. Abouzied et al. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT/ICDT 2013.
[2]
S. Agarwal et al. Blink and It's Done: Interactive Queries on Very Large Data. PVLDB 5, 12 (2012).
[3]
S. Agarwal et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD 2014.
[4]
I. Alagiannis et al. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD 2012.
[5]
R. Avnur et al. CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. In SIGMOD 1998.
[6]
S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel Data Analysis Directly on Scientific File Formats. In SIGMOD 2014.
[7]
S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of Block-Level Sampling in Statistics Estimation. In SIGMOD 2004.
[8]
S. Chen et al. PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. In SIGMOD 2010.
[9]
Y. Cheng and F. Rusu. Parallel In-Situ Data Processing with Speculative Loading. In SIGMOD 2014.
[10]
Y. Cheng and F. Rusu. SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading. ACM TODS 40, 3 (2015).
[11]
Y. Cheng, W. Zhao, and F. Rusu. OLA-RAW: Scalable Exploration over Raw Data. arXiv 1702.00358. (2017).
[12]
W. Cochran. 1977. Sampling Techniques. Wiley.
[13]
T. Condie et al. MapReduce Online. In NSDI 2010.
[14]
G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1--3 (2012).
[15]
B. Ding et al. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. In SIGMOD 2016.
[16]
A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-Charging Estimate Convergence in DBO. PVLDB 2, 1 (2009).
[17]
M. Garofalakis and P. Gibbon. Approximate Query Processing: Taming the TeraBytes. In VLDB 2001.
[18]
P. Haas. Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In SSDBM 1997.
[19]
P. Haas and J. Hellerstein. Ripple Joins for Online Aggregation. In SIGMOD 1999.
[20]
P. Haas and C. König. A Bi-Level Bernoulli Scheme for Database Sampling. In SIGMOD 2004.
[21]
J. Hellerstein, P. Haas, and H. Wang. Online Aggregation. In SIGMOD 1997.
[22]
S. Idreos et al. Here Are My Data Files. Here Are My Queries. Where Are My Results? In CIDR 2011.
[23]
M. Ivanova, M. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM 2012.
[24]
C. Jermaine et al. Online Estimation for Subset-Based SQL Queries. In VLDB 2005.
[25]
C. Jermaine et al. Scalable Approximate Query Processing with the DBO Engine. In SIGMOD 2007.
[26]
C. Jermaine et al. The Sort-Merge-Shrink Join. ACM TODS 31, 4 (2006).
[27]
S. Kandula et al. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In SIGMOD 2016.
[28]
M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries Over Heterogeneous Data Through Engine Customization. PVLDB 9, 12 (2016).
[29]
M. Karpathiotakis et al. Just-In-Time Data Virtualization: Lightweight Data Management with ViDa. In CIDR 2015.
[30]
M. Karpathiotakis et al. Adaptive Query Processing on RAW Data. PVLDB 7, 12 (2014).
[31]
M. Kornacker et al. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015.
[32]
N. Laptev, K. Zeng, and C. Zaniolo. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB 5, 10 (2012).
[33]
N. Law et al. The Palomar Transient Factory: System Overview, Performance and First Results. arXiv 0906.5350 (2009).
[34]
F. Li, B. Wu, K. Yi, and Z. Zhao. Wander Join: Online Aggregation via Random Walks. In SIGMOD 2016.
[35]
G. Luo, C. Ellmann, P. Haas, and J. Naughton. A Scalable Hash Ripple Join Algorithm. In SIGMOD 2002.
[36]
T. Muhlbauer, W. Rodiger, R. Seilbeck et al. Instant Loading for Main Memory Databases. PVLDB 6, 14 (2013).
[37]
F. Olken. Random Sampling from Databases. (1993). UC Berkeley.
[38]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB 4, 11 (2011).
[39]
C. Qin and F. Rusu. PF-OLA: A High-Performance Framework for Parallel Online Aggregation. DAPD 32, 3 (2014).
[40]
F. Rusu, F. Xu, L. Perez, M. Wu, R. Jampani, C. Jermaine, and A. Dobra. The DBO Database System. In SIGMOD 2008.
[41]
S. Thompson. 2012. Sampling. Wiley.
[42]
J. Vitter. Random Sampling with a Reservoir. ACM TOMS. 11, 1 (1985).
[43]
L. Wang, R. Christensen, F. Li, and K. Yi. Spatial Online Sampling and Aggregation. PVLDB 9, 3 (2016).
[44]
M. Wu and C. Jermaine. A Bayesian Method for Guessing the Extreme Values in a Data Set. In VLDB 2007.
[45]
S. Wu, S. Jiang, B. C. Ooi, and K.-L. Tan. Distributed Online Aggregation. PVLDB 2, 1 (2009).
[46]
S. Wu et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD 2010.
[47]
K. Zeng, S. Agarwal, and I. Stoica. iOLAP: Managing Uncertainty for Efficient Incremental OLAP. In SIGMOD 2016.
[48]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. In SIGMOD 2014.
[49]
W. Zhao, Y. Cheng, and F. Rusu. Vertical Partitioning for Query Processing over Raw Data. In SSDBM 2015.

Cited By

View all
  • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management
June 2017
373 pages
ISBN:9781450352826
DOI:10.1145/3085504
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Northwestern University: Northwestern University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSDBM '17

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Quantum Tensor DBMS and Quantum Gantt Charts: Towards Exponentially Faster Earth Data EngineeringEarth10.3390/earth50300275:3(491-547)Online publication date: 14-Sep-2024
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 12-Apr-2024
  • (2022)Studying Early Decision Making with Progressive Bar ChartsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209426(1-11)Online publication date: 2022
  • (2021)Array DBMSProceedings of the VLDB Endowment10.14778/3476311.347640414:12(3186-3189)Online publication date: 28-Oct-2021
  • (2020)Approximate partition selection for big-data workloads using summary statisticsProceedings of the VLDB Endowment10.14778/3407790.340784813:12(2606-2619)Online publication date: 14-Sep-2020
  • (2019)Joins on samplesProceedings of the VLDB Endowment10.14778/3372716.337272613:4(547-560)Online publication date: 9-Dec-2019
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2018)ChronosDBProceedings of the VLDB Endowment10.14778/3231751.323175411:10(1247-1261)Online publication date: 1-Jun-2018
  • (2018)Distributed caching for processing raw arraysProceedings of the 30th International Conference on Scientific and Statistical Database Management10.1145/3221269.3221295(1-12)Online publication date: 9-Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media