research-article

Locality-aware Partitioning in Parallel Database Systems

Authors:

Erfan Zamanian,

Carsten Binnig,

Abdallah SalamaAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 17 - 30

https://doi.org/10.1145/2723372.2723718

Published: 27 May 2015 Publication History

Abstract

Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated.

In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.

References

[1]

Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.

[2]

TPC-DS. http://www.tpc.org/tpcds/.

[3]

TPC-H. http://www.tpc.org/tpch/.

[4]

F. Akal, K. Böhm, and H.-J. Schek. OLAP Query Evaluation in a Database Cluster: A Performance Study on Intra-Query Parallelism. In ADBIS, pages 218--231, 2002.

Digital Library

[5]

C. Binnig, N. May, and T. Mindnich. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA. In BTW, pages 363--382, 2013.

[6]

C. Binnig, A. Salama, A. C. Müller, E. Zamanian, H. Kornmayer, and S. Lising. XDB: a novel database architecture for data analytics as a service. In SoCC, page 39, 2013.

Digital Library

[7]

C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. Schism: a Workload-Driven Approach to Database Replication and Partitioning. PVLDB, 3(1):48--57, 2010.

Digital Library

[8]

D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma Database Machine Project. IEEE Trans. Knowl. Data Eng., 2(1):44--62, 1990.

Digital Library

[9]

G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, J. Srinivasan, and S. Das. Supporting table partitioning by reference in Oracle. In SIGMOD Conference, pages 1111--1122, 2008.

Digital Library

[10]

S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine GRACE. In VLDB, pages 209--219, 1986.

Digital Library

[11]

R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1994.

Digital Library

[12]

H. Herodotou, N. Borisov, and S. Babu. Query optimization techniques for partitioned tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 49--60, 2011.

Digital Library

[13]

A. A. B. Lima, M. Mattoso, and P. Valduriez. Adaptive Virtual Partitioning for OLAP Query Processing in a Database Cluster. JIDM, 1(1):75--88, 2010.

[14]

R. V. Nehme and N. Bruno. Automated partitioning design in parallel database systems. In SIGMOD Conference, pages 1137--1148, 2011.

Digital Library

[15]

M. T. Özsu and P. Valduriez. Principles of Distributed Database Systems, Third Edition. Springer, 2011.

Digital Library

[16]

A. Pavlo, C. Curino, and S. B. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In SIGMOD Conference, pages 61--72, 2012.

Digital Library

[17]

A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: scalable workload-aware data placement for transactional workloads. In EDBT, pages 430--441, 2013.

Digital Library

[18]

J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical database design in a parallel database. In SIGMOD Conference, pages 558--569, 2002.

Digital Library

[19]

W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-Sensitive Operators for Parallel Main-Memory Database Clusters. In ICDE, 2014.

[20]

T. Stöhr, H. Märtens, and E. Rahm. Multi-Dimensional Database Allocation for Parallel Data Warehouses. In VLDB, pages 273--284, 2000.

Digital Library

[21]

F. M. Waas. Beyond Conventional Data Warehousing - Massively Parallel Data Processing with Greenplum Database. In BIRTE (Informal Proceedings), 2008.

[22]

T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009.

Digital Library

[23]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD Conference, pages 13--24, 2013.

Digital Library

Cited By

Koukosias AAnagnostopoulos CKolomvatsos K(2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485531
Lam JHelt JLloyd WLu HVanbever LZhang I(2024)Accelerating skewed workloads with performance multipliers in the TurboDB distributed databaseProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691892(1213-1228)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691892
Liu PLi CChen H(2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11390-024-3538-1
Show More Cited By

Index Terms

Locality-aware Partitioning in Parallel Database Systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

An Optimal Balanced Partitioning of a Set of 1D Intervals

Given a set of 1D intervals and a desired partition number, in this paper, the author examines how to make an optimal partitioning of these intervals, such that the number of intervals between the largest partition and smallest partition is minimal ...
Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing
ICPP '15: Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP)

In modern multi-core processors, last-level caches (LLCs) are typically shared among multiple cores. Previous works have shown that such sharing is beneficial as different workloads have different needs for cache capacity, and logical partitioning of ...
Horizontal partitioning method for test verification in parallel database systems

In parallel database systems the partitioning methods considered in current researches are static. This research paper presents a partitioning method to divide the database relations into dynamic horizontal partitions. Every partition contains some ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

partitioning schemes

Qualifiers

Research-article

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
2,041
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)22

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koukosias AAnagnostopoulos CKolomvatsos K(2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485531
Lam JHelt JLloyd WLu HVanbever LZhang I(2024)Accelerating skewed workloads with performance multipliers in the TurboDB distributed databaseProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691892(1213-1228)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691892
Liu PLi CChen H(2024)Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning TechniquesJournal of Computer Science and Technology10.1007/s11390-024-3538-139:2(346-368)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s11390-024-3538-1
Chatterjee SSharma SIvan NVerma SRay SStoodley MZuzarte CFinlay IOnut IShirani POnut IOnut IBranco P(2023)Compilation of SQL Queries for Efficient Distributed In-Memory ProcessingProceedings of the 33rd Annual International Conference on Computer Science and Software Engineering10.5555/3615924.3615940(149-154)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.5555/3615924.3615940
Zhang QLi JZhao HXu QLu WXiao JHan FYang CDu X(2023)Efficient Distributed Transaction Processing in Heterogeneous NetworksProceedings of the VLDB Endowment10.14778/3583140.358315316:6(1372-1385)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583153
Zhou XLi GFeng JLiu LGuo W(2023)Grep: A Graph Learning Based Database Partitioning SystemProceedings of the ACM on Management of Data10.1145/35889481:1(1-24)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588948
Zhao HLi JLu WZhang QYang WZhong JZhang MLi HDu XPan A(2023)RCBench: an RDMA-enabled transaction framework for analyzing concurrency control algorithmsThe VLDB Journal10.1007/s00778-023-00821-033:2(543-567)Online publication date: 14-Dec-2023
https://doi.org/10.1007/s00778-023-00821-0
Kumari S(2022)Cross-Domain Transfer Learning for Demand Forecasting: Using Social Media Sentiment from Related IndustriesJournal for Research in Applied Sciences and Biotechnology10.55544/jrasb.1.2.121:2(101-106)Online publication date: 30-Jun-2022
https://doi.org/10.55544/jrasb.1.2.12
Mosharraf SAdnan M(2022)Improving lookup and query execution performance in distributed Big Data systems using Cuckoo FilterJournal of Big Data10.1186/s40537-022-00563-w9:1Online publication date: 26-Jan-2022
https://doi.org/10.1186/s40537-022-00563-w
Jasny MThostrup LZiegler TBinnig CIves ZBonifati AEl Abbadi A(2022)P4DB - The Case for In-Network OLTPProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517825(1375-1389)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517825
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten