research-article

Meta-Dataflows: Efficient Exploratory Dataflow Jobs

Authors:

Raul Castro Fernandez,

William Culhane,

Pijika Watcharapichat,

Matthias Weidlich,

Victoria Lopez Morales,

Peter PietzuchAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1157 - 1172

https://doi.org/10.1145/3183713.3183760

Published: 27 May 2018 Publication History

Abstract

Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation.

We describe meta-dataflows(MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results, discarded results from underperforming branches, and pruning unnecessary branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.

References

[1]

Martín Abadi, Ashish Agarwal, et almbox. 2016. TensorFlow: A System for Large-scale Machine Learning. USENIX Conference on Operating Systems Design and Implementation (OSDI) (2016).

Digital Library

[2]

Alfred V. Aho, Peter J. Denning, and Jeffrey D. Ullman. 1971. Principles of Optimal Page Replacement. Journal of the ACM (JACM) (1971).

Digital Library

[3]

Alexander Alexandrov, Rico Bergmann, et almbox. 2014. The Stratosphere Platform for Big Data Analytics. Conference on Very Large Data Bases (VLDB) (2014).

[4]

Apache. 2017. Hadoop. http://hadoop.apache.org/. (2017).

[5]

James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. Journal of Machine Learning Research (2012).

Digital Library

[6]

Mihai Budiu, Daniel Delling, et almbox. 2011. DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines. IEEE International Parallel & Distributed Processing Symposium (IPDPS) (2011).

Digital Library

[7]

Raul Castro Fernandez, Matteo Migliavacca, et almbox. 2013. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. SIGMOD (2013).

Digital Library

[8]

Raul Castro Fernandez, Matteo Migliavacca, et almbox. 2014. Making State Explicit for Imperative Big Data Processing. USENIX Annual Technical Conference (ATC) (2014).

Digital Library

[9]

Richard L. Cole and Goetz Graefe. 1994. Optimization of Dynamic Query Evaluation Plans. SIGMOD (1994).

Digital Library

[10]

Andrew Crotty, Alex Galakatos, et almbox. 2015. An Architecture for Compiling UDF-centric Workflows. VLDB (2015).

Digital Library

[11]

Susan B. Davidson and Juliana Freire. 2008. Provenance and Scientific Workflows: Challenges and Opportunities. ACM International Conference Management of Data (SIGMOD) (2008).

Digital Library

[12]

Ewa Deelman, Karan Vahi, et almbox. 2015. Pegasus, A Workflow Management System for Science Automation. Future Generation Computer Systems (2015).

Digital Library

[13]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. Vol. 1. Springer series in statistics Springer, Berlin.

[14]

G. Graefe and K. Ward. 1989. Dynamic Query Evaluation Plans. SIGMOD (1989).

Digital Library

[15]

Pradeep Kumar Gunda, Lenin Ravindranath, et almbox. 2010. Nectar: Automatic Management of Data and Computation in Datacenters. (2010).

[16]

L. M. Haas, W. Chang, et almbox. 1990. Starburst Mid-Flight: As the Dust Clears. TKDE (1990).

Digital Library

[17]

Ramanujam Halasipuram, Prasad M Deshpande, et almbox. 2014. Determining Essential Statistics for Cost Based Optimization of an ETL Workflow. (2014).

[18]

Matthew Hill, Murray Campbell, et almbox. 2008. Event Detection in Sensor Networks for Modern Oil Fields. DEBS (2008).

Digital Library

[19]

Frank Hutter, Holger H. Hoos, et almbox. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. LION (2011).

Digital Library

[20]

Michael Isard, Mihai Budiu, et almbox. 2007. Dryad: Distributed Data-parallel Programs From Sequential Building Blocks. EuroSys (2007).

Digital Library

[21]

Mohammad Islam, Angelo K. Huang, et almbox. 2012. Oozie: Towards a Scalable Workflow Management System for Hadoop. SWEET@SIGMOD (2012).

Digital Library

[22]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features From Tiny Images. (2009).

[23]

Haoyuan Li, Ali Ghodsi, et almbox. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. ACM Symposium on Cloud Comp. (SoCC) (2014).

Digital Library

[24]

Derek G. Murray, Malte Schwarzkopf, et almbox. 2011. CIEL: A Universal Execution Engine for Distributed Data-flow Computing. NSDI (2011).

Digital Library

[25]

Eduardo Ogasawara, Jonas Dias, et almbox. 2011. An algebraic approach for data-centric scientific workflows. VLDB (2011).

[26]

Emanuel Parzen. 1962. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics (1962).

[27]

Fabian Pedregosa, Gaël Varoquaux, et almbox. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. (2011).

Digital Library

[28]

Astrid Rheinländer, Ulf Leser, et almbox. 2017. Optimization of Complex Dataflows with User-Defined Functions. ACM Surveys (2017).

Digital Library

[29]

A. Simitsis, K. Wilkinson, et almbox. 2013. HFMS: Managing the lifecycle and complexity of hybrid analytic data flows. ICDE (2013).

[30]

Apache Spark . 2017. ML Pipelines. http://spark.apache.org/docs/latest/ml-guide.html. (2017).

[31]

Evan Sparks, Shivaram Venkataraman, et almbox. 2017. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE (2017).

[32]

Spotify. 2017. Luigi. https://github.com/spotify/luigi. (2017).

[33]

Roshan Sumbaly, Jay Kreps, et almbox. 2013. The Big Data Ecosystem at LinkedIn. ACM International Conference Management of Data (SIGMOD) (2013).

Digital Library

[34]

Ilya Sutskever, James Martens, et almbox. 2013. On the Importance of Initialization and Momentum in Deep Learning. ICML (2013).

Digital Library

[35]

Kostas Tzoumas, Johann-Christoph Freytag, et almbox. 2013. Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs. ICDE (2013).

Digital Library

[36]

Vinod Kumar Vavilapalli, Arun C. Murthy, et almbox. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. ACM Symposium on Cloud Computing (SoCC) (2013).

Digital Library

[37]

Marko Vrhovnik, Holger Schwarz, et almbox. 2007. An Approach to Optimize Data Processing in Business Processes. VLDB (2007).

Digital Library

[38]

Sai Wu, Feng Li, et almbox. 2011. Query optimization for massively parallel data processing. SOCC (2011).

Digital Library

[39]

Yuan Yu, Michael Isard, et almbox. 2008. DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. OSDI (2008).

Digital Library

[40]

Matei Zaharia, Mosharaf Chowdhury, et almbox. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2012).

Digital Library

[41]

Mohammed J. Zaki and Wagner Meira Jr. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press.

Digital Library

[42]

Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. Arxiv preprint arXiv:1212.5701 (2012).

Cited By

Hilbrich MMüller SKulagina SLazik CDe Mecquenem NGrunske L(2022)A Consolidated View on Specification Languages for Data Analysis WorkflowsLeveraging Applications of Formal Methods, Verification and Validation. Software Engineering10.1007/978-3-031-19756-7_12(201-215)Online publication date: 17-Oct-2022
https://doi.org/10.1007/978-3-031-19756-7_12
Park SJeong MHan H(2021)CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics FrameworksSensors10.3390/s2107232121:7(2321)Online publication date: 26-Mar-2021
https://doi.org/10.3390/s21072321
Battle LScheidegger C(2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3028891
Show More Cited By

Index Terms

Meta-Dataflows: Efficient Exploratory Dataflow Jobs
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of ...
Stateful bulk processing for incremental analytics
SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web ...
VisMashup: Streamlining the Creation of Custom Visualization Applications

Visualization is essential for understanding the increasing volumes of digital data. However, the process required to create insightful visualizations is involved and time consuming. Although several visualization tools are available, including tools ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Google
BP plc
German Research Foundation (DFG)

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
328
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hilbrich MMüller SKulagina SLazik CDe Mecquenem NGrunske L(2022)A Consolidated View on Specification Languages for Data Analysis WorkflowsLeveraging Applications of Formal Methods, Verification and Validation. Software Engineering10.1007/978-3-031-19756-7_12(201-215)Online publication date: 17-Oct-2022
https://doi.org/10.1007/978-3-031-19756-7_12
Park SJeong MHan H(2021)CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics FrameworksSensors10.3390/s2107232121:7(2321)Online publication date: 26-Mar-2021
https://doi.org/10.3390/s21072321
Battle LScheidegger C(2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
https://doi.org/10.1109/TVCG.2020.3028891
Wang SPi AZhou X(2021)Elastic Parameter Server: Accelerating ML Training With Scalable Resource SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310424233:5(1128-1143)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1109/TPDS.2021.3104242
Kunft AKatsifodimos ASchelter SBreß SRabl TMarkl V(2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342633

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents