Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3183713.3183760acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Meta-Dataflows: Efficient Exploratory Dataflow Jobs

Published: 27 May 2018 Publication History

Abstract

Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation.
We describe meta-dataflows(MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results, discarded results from underperforming branches, and pruning unnecessary branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.

References

[1]
Martín Abadi, Ashish Agarwal, et almbox. 2016. TensorFlow: A System for Large-scale Machine Learning. USENIX Conference on Operating Systems Design and Implementation (OSDI) (2016).
[2]
Alfred V. Aho, Peter J. Denning, and Jeffrey D. Ullman. 1971. Principles of Optimal Page Replacement. Journal of the ACM (JACM) (1971).
[3]
Alexander Alexandrov, Rico Bergmann, et almbox. 2014. The Stratosphere Platform for Big Data Analytics. Conference on Very Large Data Bases (VLDB) (2014).
[4]
Apache. 2017. Hadoop. http://hadoop.apache.org/. (2017).
[5]
James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-parameter Optimization. Journal of Machine Learning Research (2012).
[6]
Mihai Budiu, Daniel Delling, et almbox. 2011. DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines. IEEE International Parallel & Distributed Processing Symposium (IPDPS) (2011).
[7]
Raul Castro Fernandez, Matteo Migliavacca, et almbox. 2013. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. SIGMOD (2013).
[8]
Raul Castro Fernandez, Matteo Migliavacca, et almbox. 2014. Making State Explicit for Imperative Big Data Processing. USENIX Annual Technical Conference (ATC) (2014).
[9]
Richard L. Cole and Goetz Graefe. 1994. Optimization of Dynamic Query Evaluation Plans. SIGMOD (1994).
[10]
Andrew Crotty, Alex Galakatos, et almbox. 2015. An Architecture for Compiling UDF-centric Workflows. VLDB (2015).
[11]
Susan B. Davidson and Juliana Freire. 2008. Provenance and Scientific Workflows: Challenges and Opportunities. ACM International Conference Management of Data (SIGMOD) (2008).
[12]
Ewa Deelman, Karan Vahi, et almbox. 2015. Pegasus, A Workflow Management System for Science Automation. Future Generation Computer Systems (2015).
[13]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. Vol. 1. Springer series in statistics Springer, Berlin.
[14]
G. Graefe and K. Ward. 1989. Dynamic Query Evaluation Plans. SIGMOD (1989).
[15]
Pradeep Kumar Gunda, Lenin Ravindranath, et almbox. 2010. Nectar: Automatic Management of Data and Computation in Datacenters. (2010).
[16]
L. M. Haas, W. Chang, et almbox. 1990. Starburst Mid-Flight: As the Dust Clears. TKDE (1990).
[17]
Ramanujam Halasipuram, Prasad M Deshpande, et almbox. 2014. Determining Essential Statistics for Cost Based Optimization of an ETL Workflow. (2014).
[18]
Matthew Hill, Murray Campbell, et almbox. 2008. Event Detection in Sensor Networks for Modern Oil Fields. DEBS (2008).
[19]
Frank Hutter, Holger H. Hoos, et almbox. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. LION (2011).
[20]
Michael Isard, Mihai Budiu, et almbox. 2007. Dryad: Distributed Data-parallel Programs From Sequential Building Blocks. EuroSys (2007).
[21]
Mohammad Islam, Angelo K. Huang, et almbox. 2012. Oozie: Towards a Scalable Workflow Management System for Hadoop. SWEET@SIGMOD (2012).
[22]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features From Tiny Images. (2009).
[23]
Haoyuan Li, Ali Ghodsi, et almbox. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. ACM Symposium on Cloud Comp. (SoCC) (2014).
[24]
Derek G. Murray, Malte Schwarzkopf, et almbox. 2011. CIEL: A Universal Execution Engine for Distributed Data-flow Computing. NSDI (2011).
[25]
Eduardo Ogasawara, Jonas Dias, et almbox. 2011. An algebraic approach for data-centric scientific workflows. VLDB (2011).
[26]
Emanuel Parzen. 1962. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics (1962).
[27]
Fabian Pedregosa, Gaël Varoquaux, et almbox. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. (2011).
[28]
Astrid Rheinländer, Ulf Leser, et almbox. 2017. Optimization of Complex Dataflows with User-Defined Functions. ACM Surveys (2017).
[29]
A. Simitsis, K. Wilkinson, et almbox. 2013. HFMS: Managing the lifecycle and complexity of hybrid analytic data flows. ICDE (2013).
[30]
Apache Spark . 2017. ML Pipelines. http://spark.apache.org/docs/latest/ml-guide.html. (2017).
[31]
Evan Sparks, Shivaram Venkataraman, et almbox. 2017. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE (2017).
[32]
Spotify. 2017. Luigi. https://github.com/spotify/luigi. (2017).
[33]
Roshan Sumbaly, Jay Kreps, et almbox. 2013. The Big Data Ecosystem at LinkedIn. ACM International Conference Management of Data (SIGMOD) (2013).
[34]
Ilya Sutskever, James Martens, et almbox. 2013. On the Importance of Initialization and Momentum in Deep Learning. ICML (2013).
[35]
Kostas Tzoumas, Johann-Christoph Freytag, et almbox. 2013. Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs. ICDE (2013).
[36]
Vinod Kumar Vavilapalli, Arun C. Murthy, et almbox. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. ACM Symposium on Cloud Computing (SoCC) (2013).
[37]
Marko Vrhovnik, Holger Schwarz, et almbox. 2007. An Approach to Optimize Data Processing in Business Processes. VLDB (2007).
[38]
Sai Wu, Feng Li, et almbox. 2011. Query optimization for massively parallel data processing. SOCC (2011).
[39]
Yuan Yu, Michael Isard, et almbox. 2008. DryadLINQ: A System for General-purpose Distributed Data-parallel Computing Using a High-level Language. OSDI (2008).
[40]
Matei Zaharia, Mosharaf Chowdhury, et almbox. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2012).
[41]
Mohammed J. Zaki and Wagner Meira Jr. 2014. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press.
[42]
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. Arxiv preprint arXiv:1212.5701 (2012).

Cited By

View all
  • (2022)A Consolidated View on Specification Languages for Data Analysis WorkflowsLeveraging Applications of Formal Methods, Verification and Validation. Software Engineering10.1007/978-3-031-19756-7_12(201-215)Online publication date: 17-Oct-2022
  • (2021)CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics FrameworksSensors10.3390/s2107232121:7(2321)Online publication date: 26-Mar-2021
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataflow
  2. distributed data processing
  3. exploratory workflows
  4. parallel data processing
  5. parameter space exploration

Qualifiers

  • Research-article

Funding Sources

  • Google
  • BP plc
  • German Research Foundation (DFG)

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Consolidated View on Specification Languages for Data Analysis WorkflowsLeveraging Applications of Formal Methods, Verification and Validation. Software Engineering10.1007/978-3-031-19756-7_12(201-215)Online publication date: 17-Oct-2022
  • (2021)CCA: Cost-Capacity-Aware Caching for In-Memory Data Analytics FrameworksSensors10.3390/s2107232121:7(2321)Online publication date: 26-Mar-2021
  • (2021)A Structured Review of Data Management Technology for Interactive Visualization and AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.302889127:2(1128-1138)Online publication date: Feb-2021
  • (2021)Elastic Parameter Server: Accelerating ML Training With Scalable Resource SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.310424233:5(1128-1143)Online publication date: 18-Oct-2021
  • (2019)An intermediate representation for optimizing machine learning pipelinesProceedings of the VLDB Endowment10.14778/3342263.334263312:11(1553-1567)Online publication date: 1-Jul-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media