research-article

To Partition, or Not to Partition, That is the Join Question in a Real System

Authors:

Maximilian Bandle,

Jana Giceva,

Thomas NeumannAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 168 - 180

https://doi.org/10.1145/3448016.3452831

Published: 18 June 2021 Publication History

Get Access

Abstract

An efficient implementation of a hash join has been a highly researched problem for decades. Recently, the radix join has been shown to have superior performance over the alternatives (e.g., the non-partitioned hash join), albeit on synthetic microbenchmarks. Therefore, it is unclear whether one can simply replace the hash join in an RDBMS or use the radix join as a performance booster for selected queries. If the latter, it is still unknown when one should rely on the radix join to improve performance.

In this paper, we address these questions, show how to integrate the radix join in Umbra, a code-generating DBMS, and make it competitive for selective queries by introducing a Bloom-filter based semi-join reducer. We have evaluated how well it runs when used in queries from more representative workloads like TPC-H. Surprisingly, the radix join brings a noticeable improvement in only one out of all 59 joins in TPC-H. Thus, with an extensive range of microbenchmarks, we have isolated the effects of the most important workload factors and synthesized the range of values where partitioning the data for the radix join pays off. Our analysis shows that the benefit of data partitioning quickly diminishes as soon as we deviate from the optimal parameters, and even late materialization rarely helps in real workloads. We thus, conclude that integrating the radix join within a code-generating database rarely justifies the increase in code and optimizer complexity and advise against it for processing real-world workloads.

Supplementary Material

MP4 File (3448016.3452831.mp4)

An efficient implementation of a hash join has been a highly researched problem for decades. Recently, the radix join has been shown to have superior performance than the alternatives (e.g., the non-partitioned hash join), albeit on synthetic microbenchmarks. So, it is not clear whether one can simply replace the hash join in an RDBMS or use the radix join as a performance booster for selected queries. If the latter, it is still unknown when one should rely on the radix join to improve the performance.In this paper, we address these questions, show how to integrate the radix join in a code-generating DBMS, and make it competitive for selective queries by introducing a bloom filter based semi-join reducer. We then evaluate how well it runs when used in queries from more representative workloads like TPC-H. Surprisingly, the radix join brings a noticeable improvement only in one out of all the 59 joins in TPC-H. Thus, with an extensive range of microbenchmarks, we isolate the effects of the most important workload factors and synthesize the range of values, where partitioning the data for the radix join pays off. Our analysis shows that the benefit of data partitioning quickly diminishes as soon as we deviate from the optimal parameters, and does not compensate for the added materialization overhead. We, thus, conclude that integrating the radix join within a database rarely justifies the increase in code and optimizer complexity and advise against it.

Download
54.47 MB

References

[1]

D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. Madden. Materialization strategies in a column-oriented DBMS. In R. Chirkova, A. Dogac, M. T. Ö zsu, and T. K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, Istanbul, Turkey, pages 466--475. IEEE Computer Society, 2007.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Overlap interval partition join

Partition based spatial-merge join

Processing continuous join queries in sensor networks: a filtering approach

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations