Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Blockjoin: efficient matrix partitioning through joins

Published: 01 September 2017 Publication History

Abstract

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state-of-the-art pipelines implemented in Spark.

References

[1]
D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In SIGMOD, pages 967--980. ACM, 2008.
[2]
F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT. ACM, 2010.
[3]
M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, 5(10):1064--1075, 2012.
[4]
A. Alexandrov et al. Implicit parallelism through deep language embedding. In SIGMOD, 2015.
[5]
Apache Hadoop, http://hadoop.apache.org.
[6]
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The multidimensional database system rasdaman. In Sigmod Record, volume 27. ACM, 1998.
[7]
M. Boehm et al. SystemML's optimizer: Plan generation for large-scale machine learning programs. IEEE Data Eng. Bull., 37(3):52--62, 2014.
[8]
M. Boehm et al. SystemML: Declarative machine learning on spark. VLDB, 9(13):1425--1436, 2016.
[9]
P. A. Boncz, S. Manegold, M. L. Kersten, et al. Database architecture optimized for the new bottleneck: Memory access. In VLDB, 1999.
[10]
R. Bosagh Zadeh et al. Matrix computations and optimization in apache spark. In KDD, pages 31--38. ACM, 2016.
[11]
P. G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In SIGMOD, pages 963--968. ACM, 2010.
[12]
S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994.
[13]
J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In FMPC, pages 120--127. IEEE, 1992.
[14]
J. Cohen, et al. Mad skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.
[15]
D. J. DeWitt et al. Implementation techniques for main memory database systems, volume 14. ACM, 1984.
[16]
A. Elgohary et al. Compressed linear algebra for large-scale machine learning. PVLDB, 9(12):960--971, 2016.
[17]
A. Ghoting et al. SystemML: Declarative machine learning on mapreduce. In ICDE, pages 231--242. IEEE, 2011.
[18]
B. Huang, S. Babu, and J. Yang. Cumulon: optimizing statistical data analysis in the cloud. In SIGMOD. ACM, 2013.
[19]
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In ICDM, 2009.
[20]
T. Kraska et al. Mlbase: A distributed machine-learning system. In CIDR, volume 1, pages 2--1, 2013.
[21]
A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In SIGMOD, pages 1969--1984. ACM, 2015.
[22]
A. Kunft, A. Alexandrov, A. Katsifodimos, and V. Markl. Bridging the gap: Towards optimization across linear and relational algebra. BeyondMR, pages 1:1--1:4, 2016.
[23]
J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In ACM KDD, 2005.
[24]
Z. Li and K. A. Ross. Fast joins using join indices. VLDB, 8(1):1--24, 1999.
[25]
R. Marek and E. Rahm. Tid hash joins. In CIKM, pages 42--49. ACM, 1994.
[26]
X. Meng et al. Mllib: Machine learning in apache spark. JMLR, 17(34):1--7, 2016.
[27]
J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng, 16(5):558--560, 1990.
[28]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD. ACM, 2011.
[29]
O. Polychroniou, R. Sen, and K. A. Ross. Track join: distributed joins with minimal network traffic. In SIGMOD, pages 1483--1494. ACM, 2014.
[30]
W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In ICDE, pages 1194--1205. IEEE, 2016.
[31]
N. Roussopoulos and H. Kang. A pipeline n-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Eng, 3(4):486--495, 1991.
[32]
S. Schelter et al. Efficient sample generation for scalable meta learning. In ICDE, 2015.
[33]
S. Schelter et al. Samsara: Declarative machine learning on distributed dataflow systems. In NIPS Workshop MLSystems, 2016.
[34]
E. R. Sparks et al. Mli: An api for distributed machine learning. In ICDM, pages 1187--1192. IEEE, 2013.
[35]
J. W. Stamos and H. C. Young. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst, 4(12):1345--1354, 1993.
[36]
D. Tsirogiannis, S. Harizopoulos, M. A. Shah, J. L. Wiener, and G. Graefe. Query processing techniques for solid state drives. In SIGMOD. ACM, 2009.
[37]
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In ACM SoCC, 2011.
[38]
M. Zaharia et al. Spark: Cluster computing with working sets. HotCloud, 10(10--10):95, 2010.
[39]
C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. TODS, 41(1):2, 2016.
[40]
J. Zhou, P.-A. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the scope optimizer. In ICDE, pages 1060--1071. IEEE, 2010.

Cited By

View all
  1. Blockjoin: efficient matrix partitioning through joins

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 13
    Proceedings of the 43rd International Conference on Very Large Data Bases, Munich, Germany
    September 2017
    72 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 September 2017
    Published in PVLDB Volume 10, Issue 13

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Exploiting Sharing Join Opportunities in Big Data Multiquery Optimization with FlinkComplexity10.1155/2020/66171492020Online publication date: 1-Jan-2020
    • (2020)Data Preparation for Duplicate DetectionJournal of Data and Information Quality10.1145/337787812:3(1-24)Online publication date: 13-Jun-2020
    • (2020)On the Expressive Power of Linear Algebra on GraphsTheory of Computing Systems10.1007/s00224-020-09990-965:1(179-239)Online publication date: 4-Oct-2020
    • (2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019
    • (2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
    • (2019)Expressiveness of Matrix and Tensor Query Languages in terms of ML OperatorsProceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning10.1145/3329486.3329498(1-5)Online publication date: 30-Jun-2019
    • (2018)Mosaics in Big DataProceedings of the 12th ACM International Conference on Distributed and Event-based Systems10.1145/3210284.3214344(7-13)Online publication date: 25-Jun-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media