Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3415958.3433082acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

Scalable Execution of Big Data Workflows using Software Containers

Published: 27 November 2020 Publication History

Abstract

Big Data processing involves handling large and complex data sets, incorporating different tools and frameworks as well as other processes that help organisations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, require taking advantage of the elasticity of cloud infrastructures for scalability. In this paper, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies and message-oriented middleware (MOM) to enable highly scalable workflow execution. The approach is demonstrated in a use case together with a set of experiments that demonstrate the practical applicability of the proposed approach for the scalable execution of Big Data workflows. Furthermore, we present a scalability comparison of our proposed approach with that of Argo Workflows - one of the most prominent tools in the area of Big Data workflows.

References

[1]
D Culler Arvind et al. 1984. The tagged token dataflow architecture. Technical Report. Technical report, MIT Laboratory for Computer Science.
[2]
Mutaz Barika et al. 2019. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. Comput. Surveys 52, 5 (2019).
[3]
Edward Curry. 2005. Message-Oriented Middleware. John Wiley & Sons, Ltd, 1--28.
[4]
Yared Dejene Dessalk. 2020. Big Data Workflows: DSL-based Specification and Software Containers for Scalable Execution. The Royal Institute of Technology, 1--49.
[5]
W. Gerlach et al. 2014. Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows. In Proc. of the DataCloud 2014. 25--32.
[6]
A. Kashlev et al. 2017. Big Data Workflows: A Reference Architecture and the DATAVIEW System. Services Transactions on Big Data 4, 1 (2017).
[7]
Marjan Mernik et al. 2005. When and How to Develop Domain-Specific Languages. Comput. Surveys 37, 4 (2005), 316--344.
[8]
Sara Migliorini et al. 2011. Pattern-Based Evaluation of Scientific Workflow Management Systems.
[9]
N. Naik. 2017. Docker container-based big data processing system in multiple clouds for everyone. In Proc. of the ISSE 2017. 1--7.
[10]
R. Qasha et al. 2016. Dynamic Deployment of Scientific Workflows in the Cloud Using Container Virtualization. In Proc. of the CloudCom 2016. 269--276.
[11]
R. Ranjan et al. 2017. Orchestrating Big Data Analysis Workflows. IEEE Cloud Computing 4, 3 (2017), 20--28.
[12]
Nick Russell et al. 2005. Workflow Data Patterns: Identification, Representation and Tool Support. In Proc. of the ER 2005. 353--368.
[13]
C. Wulf et al. 2016. Increasing the Throughput of Pipe-and-Filter Architectures by Integrating the Task Farm Parallelization Pattern. In Proc. of the CBSE 2016. 13--22.
[14]
Charles Zheng and Douglas Thain. 2015. Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker. In Proc. of the VTDC 2015. 31--38.

Cited By

View all
  • (2024)sAirflow: Adopting Serverless in a Legacy Workflow SchedulerEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_18(254-268)Online publication date: 26-Aug-2024
  • (2022)Sustainable Big Data Analytics Process Pipeline Using Apache EcosystemEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch073(1247-1259)Online publication date: 14-Oct-2022
  • (2022)Dataclouddsl: Textual and Visual Presentation of Big Data Pipelines2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00183(1165-1171)Online publication date: Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEDES '20: Proceedings of the 12th International Conference on Management of Digital EcoSystems
November 2020
170 pages
ISBN:9781450381154
DOI:10.1145/3415958
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data workflows
  2. Domain-specific languages
  3. Software containers

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MEDES '20
MEDES '20: 12th International Conference on Management of Digital EcoSystems
November 2 - 4, 2020
Virtual Event, United Arab Emirates

Acceptance Rates

MEDES '20 Paper Acceptance Rate 19 of 27 submissions, 70%;
Overall Acceptance Rate 267 of 682 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)sAirflow: Adopting Serverless in a Legacy Workflow SchedulerEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_18(254-268)Online publication date: 26-Aug-2024
  • (2022)Sustainable Big Data Analytics Process Pipeline Using Apache EcosystemEncyclopedia of Data Science and Machine Learning10.4018/978-1-7998-9220-5.ch073(1247-1259)Online publication date: 14-Oct-2022
  • (2022)Dataclouddsl: Textual and Visual Presentation of Big Data Pipelines2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00183(1165-1171)Online publication date: Jun-2022
  • (2022)Matching-based Scheduling of Asynchronous Data Processing Workflows on the Computing Continuum2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00021(58-70)Online publication date: Sep-2022
  • (2022)Supporting Semantic Data Enrichment at ScaleTechnologies and Applications for Big Data Value10.1007/978-3-030-78307-5_2(19-39)Online publication date: 29-Apr-2022
  • (2021)Big Data Workflows: Locality-Aware Orchestration Using Software ContainersSensors10.3390/s2124821221:24(8212)Online publication date: 8-Dec-2021
  • (2021)Locality-Aware Workflow Orchestration for Big DataProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485106(62-70)Online publication date: 1-Nov-2021
  • (2021)Conceptualization and scalable execution of big data workflows using domain-specific languages and software containersInternet of Things10.1016/j.iot.2021.10044016(100440)Online publication date: Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media