Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2723372.2723718acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Locality-aware Partitioning in Parallel Database Systems

Published: 27 May 2015 Publication History

Abstract

Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated.
In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.

References

[1]
Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.
[2]
TPC-DS. http://www.tpc.org/tpcds/.
[3]
TPC-H. http://www.tpc.org/tpch/.
[4]
F. Akal, K. Böhm, and H.-J. Schek. OLAP Query Evaluation in a Database Cluster: A Performance Study on Intra-Query Parallelism. In ADBIS, pages 218--231, 2002.
[5]
C. Binnig, N. May, and T. Mindnich. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA. In BTW, pages 363--382, 2013.
[6]
C. Binnig, A. Salama, A. C. Müller, E. Zamanian, H. Kornmayer, and S. Lising. XDB: a novel database architecture for data analytics as a service. In SoCC, page 39, 2013.
[7]
C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. Schism: a Workload-Driven Approach to Database Replication and Partitioning. PVLDB, 3(1):48--57, 2010.
[8]
D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma Database Machine Project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990.
[9]
G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, J. Srinivasan, and S. Das. Supporting table partitioning by reference in Oracle. In SIGMOD Conference, pages 1111--1122, 2008.
[10]
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine GRACE. In VLDB, pages 209--219, 1986.
[11]
R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1994.
[12]
H. Herodotou, N. Borisov, and S. Babu. Query optimization techniques for partitioned tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 49--60, 2011.
[13]
A. A. B. Lima, M. Mattoso, and P. Valduriez. Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster. JIDM, 1(1):75--88, 2010.
[14]
R. V. Nehme and N. Bruno. Automated partitioning design in parallel database systems. In SIGMOD Conference, pages 1137--1148, 2011.
[15]
M. T. Özsu and P. Valduriez. Principles of Distributed Database Systems, Third Edition. Springer, 2011.
[16]
A. Pavlo, C. Curino, and S. B. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In SIGMOD Conference, pages 61--72, 2012.
[17]
A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: scalable workload-aware data placement for transactional workloads. In EDBT, pages 430--441, 2013.
[18]
J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical database design in a parallel database. In SIGMOD Conference, pages 558--569, 2002.
[19]
W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-Sensitive Operators for Parallel Main-Memory Database Clusters. In ICDE, 2014.
[20]
T. Stöhr, H. Märtens, and E. Rahm. Multi-Dimensional Database Allocation for Parallel Data Warehouses. In VLDB, pages 273--284, 2000.
[21]
F. M. Waas. Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database. In BIRTE (Informal Proceedings), 2008.
[22]
T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.
[23]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, pages 13--24, 2013.

Cited By

View all
  • (2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
  • (2024)Accelerating skewed workloads with performance multipliers in the TurboDB distributed databaseProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691892(1213-1228)Online publication date: 16-Apr-2024
  • (2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. partitioning schemes

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)22
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
  • (2024)Accelerating skewed workloads with performance multipliers in the TurboDB distributed databaseProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691892(1213-1228)Online publication date: 16-Apr-2024
  • (2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
  • (2023)Compilation of SQL Queries for Efficient Distributed In-Memory ProcessingProceedings of the 33rd Annual International Conference on Computer Science and Software Engineering10.5555/3615924.3615940(149-154)Online publication date: 11-Sep-2023
  • (2023)Efficient Distributed Transaction Processing in Heterogeneous NetworksProceedings of the VLDB Endowment10.14778/3583140.358315316:6(1372-1385)Online publication date: 20-Apr-2023
  • (2023)Grep: A Graph Learning Based Database Partitioning SystemProceedings of the ACM on Management of Data10.1145/35889481:1(1-24)Online publication date: 30-May-2023
  • (2023)RCBench: an RDMA-enabled transaction framework for analyzing concurrency control algorithmsThe VLDB Journal10.1007/s00778-023-00821-033:2(543-567)Online publication date: 14-Dec-2023
  • (2022)Cross-Domain Transfer Learning for Demand Forecasting: Using Social Media Sentiment from Related IndustriesJournal for Research in Applied Sciences and Biotechnology10.55544/jrasb.1.2.121:2(101-106)Online publication date: 30-Jun-2022
  • (2022)Improving lookup and query execution performance in distributed Big Data systems using Cuckoo FilterJournal of Big Data10.1186/s40537-022-00563-w9:1Online publication date: 26-Jan-2022
  • (2022)P4DB - The Case for In-Network OLTPProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517825(1375-1389)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media