Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Skipping-oriented partitioning for columnar layouts

Published: 01 November 2016 Publication History

Abstract

As data volumes continue to grow, modern database systems increasingly rely on data skipping mechanisms to improve performance by avoiding access to irrelevant data. Recent work [39] proposed a fine-grained partitioning scheme that was shown to improve the opportunities for data skipping in row-oriented systems. Modern analytics and big data systems increasingly adopt columnar storage schemes, and in such systems, a row-based approach misses important opportunities for further improving data skipping. The flexibility of column-oriented organizations, however, comes with the additional cost of tuple reconstruction. In this paper, we develop Generalized Skipping-Oriented Partitioning (GSOP), a novel hybrid data skipping framework that takes into account these row-based and column-based tradeoffs. In contrast to previous column-oriented physical design work, GSOP considers the tradeoffs between horizontal data skipping and vertical partitioning jointly. Our experiments using two public benchmarks and a real-world workload show that GSOP can significantly reduce the amount of data scanned and improve end-to-end query response times over the state-of-the- art techniques.

References

[1]
Apache Drill. https://drill.apache.org.
[2]
Apache Parquet. http://parquet.apache.org.
[3]
Big Data Benchmark. amplab.cs.berkeley.edu/benchmark.
[4]
CasJobs. http://skyserver.sdss.org/casjobs/.
[5]
Sloan Digital Sky Surveys. http://www.sdss.org.
[6]
TPC-H. http://www.tpc.org/tpch.
[7]
A. Ailamaki et al. Data page layouts for relational databases on deep memory hierarchies. VLDB Journal, 11(3):198--215, Nov. 2002.
[8]
A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In SIGMOD, pages 1917--1923, 2015.
[9]
A. Hall et al. Processing a trillion cells per mouse click. PVLDB, 5(11):1436--1446, 2012.
[10]
A. Jindal et al. Trojan data layouts: Right shoes for a running elephant. In SOCC, pages 21:1--21:14, New York, NY, USA, 2011.
[11]
A. Lamb et al. The Vertica analytic database: C-Store 7 years later. VLDB, 5(12):1790--1801, 2012.
[12]
D. Abadi, D. Myers, D. DeWitt, and S. Madden. Materialization strategies in a column-oriented dbms. In ICDE, pages 466--475, April 2007.
[13]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994.
[14]
I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, pages 1103--1114, New York, NY, USA, 2014. ACM.
[15]
B. Bhattacharjee et al. Efficient query processing for multi-dimensionally clustered tables in DB2. In VLDB, pages 963--974, 2003.
[16]
B. Dageville et al. The Snowflake elastic data warehouse. In SIGMOD, pages 215--226, 2016.
[17]
P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.
[18]
C. Curino, E. Jones, Y. Zhang, and S. Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3:48--57, 2010.
[19]
D. Abadi et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, SIGMOD, pages 671--682, 2006.
[20]
D. Abadi et al. The design and implementation of modern column-oriented database systems. Foundations and Trends in Databases, 5(3), 2013.
[21]
D. Ślȩzak et al. Brighthouse: An analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008.
[22]
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7), 2011.
[23]
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, pages 68--78, 2007.
[24]
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, pages 297--308, 2009.
[25]
A. Jindal, E. Palatinus, V. Pavlov, and J. Dittrich. A comparison of knives for bread slicing. PVLDB, 6(6):361--372, 2013.
[26]
M. Armbrust et al. Spark SQL: relational data processing in spark. In SIGMOD, pages 1383--1394, 2015.
[27]
M. Grund et al. Hyrise: A main memory hybrid storage engine. PVLDB, 4(2):105--116, Nov. 2010.
[28]
M. Stonebraker et al. C-store: A column-oriented DBMS. In VLDB, pages 553--564, 2005.
[29]
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012.
[30]
M. Zukowski el al. DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing. In DaMoN, pages 47--54, 2008.
[31]
G. Moerkotte. Small materialized aggregates: A light weight index for data warehousing. In VLDB, pages 476--487, 1998.
[32]
R. Hankins et al. Data morphing: An adaptive, cache-conscious storage technique. In VLDB, pages 417--428. VLDB Endowment, 2003.
[33]
J. Rao, C. Zhang, N. Megiddo, and G. Lohman. Automating physical database design in a parallel database. In SIGMOD, pages 558--569, 2002.
[34]
S. Agarwal et al. Automated selection of materialized views and indexes in SQL databases. In VLDB, pages 496--505, 2000.
[35]
S. Agrawal et al. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, pages 359--370, 2004.
[36]
S. Melnik et al. Dremel: interactive analysis of webale datasets. PVLDB, 3(1--2):330--339, 2010.
[37]
S. Papadomanolakis el al. AutoPart: Automating schema design for large scientific databases using data partitioning. In SSDBM, pages 383--392, 2004.
[38]
F. M. Schuhknecht, A. Jindal, and J. Dittrich. The uncracked pieces in database cracking. PVLDB, 7(2):97--108, Oct. 2013.
[39]
L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In SIGMOD, pages 1115--1126, 2014.
[40]
V. Raman et al. DB2 with BLU acceleration: So much more than just a column store. PVLDB, 6(11):1080--1091, 2013.
[41]
Y. He et al. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011.
[42]
Y. Huai et al. Understanding insights into the basic structure and essential issues of table placement methods in clusters. PVLDB, 6(14), 2013.
[43]
Yin Huai et al. Major technical advancements in Apache Hive. In SIGMOD, pages 1235--1246, 2014.
[44]
Z Liu et al. JSON data management: Supporting schema-less development in rdbms. In SIGMOD, pages 1247--1258, New York, NY, USA, 2014.
[45]
J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. In SIGMOD, pages 13--24, 2012.
[46]
J. Zhou and K. Ross. A multi-resolution block storage model for database design. In IDEAS, pages 22--31, July 2003.

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 10, Issue 4
November 2016
180 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2016
Published in PVLDB Volume 10, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)7
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
  • (2024)RTScan: Efficient Scan with Ray Tracing CoresProceedings of the VLDB Endowment10.14778/3648160.364818317:6(1460-1472)Online publication date: 3-May-2024
  • (2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
  • (2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
  • (2024)Towards flexibility and robustness of LSM treesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00826-933:4(1105-1128)Online publication date: 1-Jul-2024
  • (2023)Sieve: A Learned Data-Skipping Index for Data AnalyticsProceedings of the VLDB Endowment10.14778/3611479.361152016:11(3214-3226)Online publication date: 24-Aug-2023
  • (2023)Pando: Enhanced Data Skipping with Logical Data PartitioningProceedings of the VLDB Endowment10.14778/3598581.359860116:9(2316-2329)Online publication date: 1-May-2023
  • (2023)SH2O: Efficient Data Access for Work-Sharing DatabasesProceedings of the ACM on Management of Data10.1145/36173401:3(1-26)Online publication date: 13-Nov-2023
  • (2023)LogGrep: Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime PatternsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567484(452-468)Online publication date: 8-May-2023
  • (2023)SAT: sampling acceleration tree for adaptive database repartitionWorld Wide Web10.1007/s11280-023-01199-326:5(3503-3533)Online publication date: 3-Aug-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media