research-article

Parallel in-situ data processing with speculative loading

Authors:

Florin RusuAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1287 - 1298

https://doi.org/10.1145/2588555.2593673

Published: 18 June 2014 Publication History

Abstract

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data, e.g., genomics, databases are entirely discarded. External tables, on the other hand, provide instant SQL querying over raw files. Their performance across a query workload is limited though by the speed of repeated full scans, tokenizing, and parsing of the entire file. In this paper, we propose SCANRAW, a novel database physical operator for in-situ processing over raw files that integrates data loading and external tables seamlessly while preserving their advantages: optimal performance across a query workload and zero time-to-query. Our major contribution is a parallel super-scalar pipeline implementation that allows SCANRAW to take advantage of the current many- and multi-core processors by overlapping the execution of independent stages. Moreover, SCANRAW overlaps query processing with loading by speculatively using the additional I/O bandwidth arising during the conversion process for storing data into the database such that subsequent queries execute faster. As a result, SCANRAW makes optimal use of the available system resources -- CPU cycles and I/O bandwidth -- by switching dynamically between tasks to ensure that optimal performance is achieved. We implement SCANRAW in a state-of-the-art database system and evaluate its performance across a variety of synthetic and real-world datasets. Our results show that SCANRAW with speculative loading achieves optimal performance for a query sequence at any point in the processing. Moreover, SCANRAW maximizes resource utilization for the entire workload execution while speculatively loading data and without interfering with normal query processing.

References

[1]

New NIH-Funded Resource Focuses on Use of Genomic Variants in Medical Care, 2013. http://www.nih.gov/news/health/sep2013/nhgri-25.htm.

[2]

1000 Genomes. http://www.1000genomes.org/data.

[3]

A. Ailamaki, V. Kantere, and D. Dash. Managing Scientific Data. Commun. ACM, 53, 2010.

Digital Library

[4]

A. Abouzied, D. Abadi, and A. Silberschatz. Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems. In EDBT/ICDT 2013.

Digital Library

[5]

I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD 2012.

Digital Library

[6]

S. Arumugam, A. Dobra, C. Jermaine, N. Pansare, and L. Perez. The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses. In SIGMOD 2010.

Digital Library

[7]

BAMTools. http://sourceforge.net/bamtools/.

[8]

D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of Database Processing or a Passing Fad? SIGMOD Rec., 19, 1991.

Digital Library

[9]

H. Li et al. The Sequence Alignment/Map Format and SAMtools. Bioinformatics, 25(16), 2009.

Digital Library

[10]

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here Are My Data Files. Here Are My Queries. Where Are My Results? In CIDR 2011.

[11]

M. Ivanova, M. L. Kersten, and S. Manegold. Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories. In SSDBM 2012.

Digital Library

[12]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1), 2008.

Digital Library

[13]

J. Gray et al. Scientific Data Management in the Coming Decade. SIGMOD Rec., 34, 2005.

Digital Library

[14]

K. Lorincz et al. Grep versus FlatSQL versus MySQL: Queries using UNIX tools vs. a DBMS, 2003. Harvard University.

[15]

M. Kersten, S. Idreos, S. Manegold, and E. Liarou. The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds. PVLDB, 4, 2011.

[16]

M. Stonebraker et al. Requirements for Science Data Bases and SciDB. In CIDR 2009.

[17]

MySQL CSV Storage Engine. http://dev.mysql.com/doc/refman/5.0/en/csv-storage-engine.html.

[18]

N. Alur et al. IBM DataStage Data Flow and Job Design. 2008.

[19]

Optiq. https://github.com/julianhyde/optiq.

[20]

R. Avnur and J. Hellerstein. Eddies: Continuously Adaptive Query Processing. In SIGMOD 2000.

Digital Library

[21]

S. Idreos et al. MonetDB: Two Decades of Research in Column-Oriented Database Architectures. IEEE Data Eng. Bull., 35(1), 2012.

[22]

S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores. PVLDB, 4, 2011.

Digital Library

[23]

T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery, 2009. Microsoft Research.

[24]

T. Muhlbauer et al. Instant Loading for Main Memory Databases. PVLDB, 6(14), 2013.

Digital Library

[25]

A. Witkowski, M. Colgan, A. Brumm, T. Cruanes, and H. Baer. Performant and Scalable Data Loading with Oracle Database 11g, 2011.

Cited By

Nguyen TRahman MDi SBecchi M(2024)Significantly Improving Fixed-Ratio Compression Framework for Resource-limited ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673092(845-855)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673092
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Gavriilidis HHenze FTzirita Zacharatou EMarkl V(2023)SheetReaderInformation Systems10.1016/j.is.2023.102183115:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.is.2023.102183
Show More Cited By

Index Terms

Parallel in-situ data processing with speculative loading
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading

Traditional databases incur a significant data-to-query delay due to the requirement to load data inside the system before querying. Since this is not acceptable in many domains generating massive amounts of raw data (e.g., genomics), databases are ...
Expedited rating of data stores using agile data loading techniques
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

To benchmark and rate a data store, one must repeat experiments that impose a different amount of load on the data store. Workloads that modify the benchmark database may require the same database to be loaded repeatedly. This may constitute a ...
Enhancing Parallel Data Loading for Large Scale Scientific Database
ICA3PP 2015: Proceedings, Part II, of the 15th International Conference on Algorithms and Architectures for Parallel Processing - Volume 9529

The rapidly increased data size make large scale scientific database often have a huge time delay between loading data into the system and ready for receiving query request. To solve this problem, we proposed an efficient parallel data loading approach ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nguyen TRahman MDi SBecchi M(2024)Significantly Improving Fixed-Ratio Compression Framework for Resource-limited ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673092(845-855)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673092
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Gavriilidis HHenze FTzirita Zacharatou EMarkl V(2023)SheetReaderInformation Systems10.1016/j.is.2023.102183115:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.is.2023.102183
Jiang LZhao ZFalsafi BFerdman MLu SWenisch T(2022)JSONSki: streaming semi-structured data with bit-parallel fast-forwardingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507719(200-211)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507719
Zhao KDi SPerez DLiang XChen ZCappello F(2022)MDZ: An Efficient Error-bounded Lossy Compressor for Molecular Dynamics2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00007(27-40)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00007
Jiang LQiu JZhao Z(2021)Scalable structural index construction for JSON analyticsProceedings of the VLDB Endowment10.14778/3436905.343692614:4(694-707)Online publication date: 22-Feb-2021
https://dl.acm.org/doi/10.14778/3436905.3436926
Zhao KDi SDmitriev MTonellot TChen ZCappello F(2021)Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00145(1643-1654)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00145
Watson ADas SRay S(2021)DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA53316.2021.9564218(1-10)Online publication date: 6-Oct-2021
https://doi.org/10.1109/DSAA53316.2021.9564218
Dong BWu KByna SDong BWu KByna S(2021)IntroductionUser-Defined Tensor Data Analysis10.1007/978-3-030-70750-7_1(1-8)Online publication date: 22-Feb-2021
https://doi.org/10.1007/978-3-030-70750-7_1
Stehle EJacobsen H(2020)ParPaRawProceedings of the VLDB Endowment10.14778/3377369.337737213:5(616-628)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.14778/3377369.3377372
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents