research-article

Blockjoin: efficient matrix partitioning through joins

Authors:

Asterios Katsifodimos,

Sebastian Schelter,

Volker MarklAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 10, Issue 13

Pages 2061 - 2072

https://doi.org/10.14778/3151106.3151110

Published: 01 September 2017 Publication History

Abstract

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state-of-the-art pipelines implemented in Spark.

References

[1]

D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In SIGMOD, pages 967--980. ACM, 2008.

Digital Library

[2]

F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT. ACM, 2010.

Digital Library

[3]

M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10):1064--1075, 2012.

Digital Library

[4]

A. Alexandrov et al. Implicit parallelism through deep language embedding. In SIGMOD, 2015.

Digital Library

[5]

Apache Hadoop, http://hadoop.apache.org.

[6]

P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional database system rasdaman. In Sigmod Record, volume 27. ACM, 1998.

Digital Library

[7]

M. Boehm et al. SystemML's optimizer: Plan generation for large-scale machine learning programs. IEEE Data Eng. Bull., 37(3):52--62, 2014.

[8]

M. Boehm et al. SystemML: Declarative machine learning on spark. VLDB, 9(13):1425--1436, 2016.

Digital Library

[9]

P. A. Boncz, S. Manegold, M. L. Kersten, et al. Database architecture optimized for the new bottleneck: Memory access. In VLDB, 1999.

Digital Library

[10]

R. Bosagh Zadeh et al. Matrix computations and optimization in apache spark. In KDD, pages 31--38. ACM, 2016.

Digital Library

[11]

P. G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD, pages 963--968. ACM, 2010.

Digital Library

[12]

S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994.

Digital Library

[13]

J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In FMPC, pages 120--127. IEEE, 1992.

[14]

J. Cohen, et al. Mad skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.

Digital Library

[15]

D. J. DeWitt et al. Implementation techniques for main memory database systems, volume 14. ACM, 1984.

Digital Library

[16]

A. Elgohary et al. Compressed linear algebra for large-scale machine learning. PVLDB, 9(12):960--971, 2016.

Digital Library

[17]

A. Ghoting et al. SystemML: Declarative machine learning on mapreduce. In ICDE, pages 231--242. IEEE, 2011.

Digital Library

[18]

B. Huang, S. Babu, and J. Yang. Cumulon: optimizing statistical data analysis in the cloud. In SIGMOD. ACM, 2013.

Digital Library

[19]

U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, 2009.

Digital Library

[20]

T. Kraska et al. Mlbase: A distributed machine-learning system. In CIDR, volume 1, pages 2--1, 2013.

[21]

A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984. ACM, 2015.

Digital Library

[22]

A. Kunft, A. Alexandrov, A. Katsifodimos, and V. Markl. Bridging the gap: Towards optimization across linear and relational algebra. BeyondMR, pages 1:1--1:4, 2016.

Digital Library

[23]

J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In ACM KDD, 2005.

Digital Library

[24]

Z. Li and K. A. Ross. Fast joins using join indices. VLDB, 8(1):1--24, 1999.

Digital Library

[25]

R. Marek and E. Rahm. Tid hash joins. In CIKM, pages 42--49. ACM, 1994.

Digital Library

[26]

X. Meng et al. Mllib: Machine learning in apache spark. JMLR, 17(34):1--7, 2016.

Digital Library

[27]

J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng, 16(5):558--560, 1990.

Digital Library

[28]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD. ACM, 2011.

Digital Library

[29]

O. Polychroniou, R. Sen, and K. A. Ross. Track join: distributed joins with minimal network traffic. In SIGMOD, pages 1483--1494. ACM, 2014.

Digital Library

[30]

W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In ICDE, pages 1194--1205. IEEE, 2016.

[31]

N. Roussopoulos and H. Kang. A pipeline n-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Eng, 3(4):486--495, 1991.

Digital Library

[32]

S. Schelter et al. Efficient sample generation for scalable meta learning. In ICDE, 2015.

[33]

S. Schelter et al. Samsara: Declarative machine learning on distributed dataflow systems. In NIPS Workshop MLSystems, 2016.

[34]

E. R. Sparks et al. Mli: An api for distributed machine learning. In ICDM, pages 1187--1192. IEEE, 2013.

[35]

J. W. Stamos and H. C. Young. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst, 4(12):1345--1354, 1993.

Digital Library

[36]

D. Tsirogiannis, S. Harizopoulos, M. A. Shah, J. L. Wiener, and G. Graefe. Query processing techniques for solid state drives. In SIGMOD. ACM, 2009.

Digital Library

[37]

S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In ACM SoCC, 2011.

Digital Library

[38]

M. Zaharia et al. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.

Digital Library

[39]

C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. TODS, 41(1):2, 2016.

Digital Library

[40]

J. Zhou, P.-A. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the scope optimizer. In ICDE, pages 1060--1071. IEEE, 2010.

Cited By

Gao XSahal RChen GKhafagy MOmara F(2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1155/2020/6617149
Koumarelas IJiang LNaumann F(2020)Data Preparation for Duplicate DetectionJournal of Data and Information Quality10.1145/337787812:3(1-24)Online publication date: 13-Jun-2020
https://dl.acm.org/doi/10.1145/3377878
Geerts F(2020)On the Expressive Power of Linear Algebra on GraphsTheory of Computing Systems10.1007/s00224-020-09990-965:1(179-239)Online publication date: 4-Oct-2020
https://dl.acm.org/doi/10.1007/s00224-020-09990-9
Show More Cited By

Blockjoin: efficient matrix partitioning through joins
1. Information systems
  1. Data management systems

Recommendations

Big Data Analytics
The induced generalized OWA operator

We present the induced generalized ordered weighted averaging (IGOWA) operator. It is a new aggregation operator that generalizes the OWA operator, including the main characteristics of both the generalized OWA and the induced OWA operator. This ...
Constructing Choquet integral-based operators that generalize weighted means and OWA operators

A new class of aggregation operators is proposed to generalize weighted means and OWA operators.SUOWA operators are defined by using Choquet integral.These operators are continuous, monotonic, idempotent, compensative and homogeneous of degree 1 ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 10, Issue 13

Proceedings of the 43rd International Conference on Very Large Data Bases, Munich, Germany

September 2017

72 pages

ISSN:2150-8097

Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2017

Published in PVLDB Volume 10, Issue 13

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
127
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)3

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao XSahal RChen GKhafagy MOmara F(2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1155/2020/6617149
Koumarelas IJiang LNaumann F(2020)Data Preparation for Duplicate DetectionJournal of Data and Information Quality10.1145/337787812:3(1-24)Online publication date: 13-Jun-2020
https://dl.acm.org/doi/10.1145/3377878
Geerts F(2020)On the Expressive Power of Linear Algebra on GraphsTheory of Computing Systems10.1007/s00224-020-09990-965:1(179-239)Online publication date: 4-Oct-2020
https://dl.acm.org/doi/10.1007/s00224-020-09990-9
Kunft AKatsifodimos ASchelter SBreß SRabl TMarkl V(2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342633
Abedjan ZBreß SMarkl VRabl TSoto J(2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
https://dl.acm.org/doi/10.1145/3335409.3335415
Barceló PHiguera NPérez JSubercaseaux B(2019)Expressiveness of Matrix and Tensor Query Languages in terms of ML OperatorsProceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning10.1145/3329486.3329498(1-5)Online publication date: 30-Jun-2019
https://dl.acm.org/doi/10.1145/3329486.3329498
Markl V(2018)Mosaics in Big DataProceedings of the 12th ACM International Conference on Distributed and Event-based Systems10.1145/3210284.3214344(7-13)Online publication date: 25-Jun-2018
https://dl.acm.org/doi/10.1145/3210284.3214344

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents