Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework

Qianwen Ye⁹,
Chase Q. Wu⁹,
Wuji Liu⁹,
Aiqin Hou¹⁰ &
…
Wei Shen¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12454))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1976 Accesses

Abstract

Big data processing and analysis increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of big data workflows is now commonly supported on reliable and scalable data storage and computing platforms such as Hadoop. There are a variety of factors affecting workflow performance across multiple layers of big data systems, including the inherent properties (such as scale and topology) of the workflow, the parallel computing engine it runs on, the resource manager that orchestrates distributed resources, the file system that stores data, as well as the parameter setting of each layer. Optimizing workflow performance is challenging because the compound effects of the aforementioned layers are complex and opaque to end users. Generally, tuning their parameters requires an in-depth understanding of big data systems, and the default settings do not always yield optimal performance. We propose a profiling-based cross-layer coupled design framework to determine the best parameter setting for each layer in the entire technology stack to optimize workflow performance. To tackle the large parameter space, we reduce the number of experiments needed for profiling with two approaches: i) identify a subset of critical parameters with the most significant influence through feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. Experimental results show that the proposed optimization framework provides the most suitable parameter settings for a given workflow to achieve the best performance. This profiling-based method could be used by end users and service providers to configure and execute large-scale workflows in complex big data systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The many faces of data-centric workflow optimization: a survey

Article 06 March 2018

Workflow Scheduling Issues and Techniques in Cloud Computing: A Systematic Literature Review

Electricity-cost-aware multi-workflow scheduling in heterogeneous cloud

Article 24 February 2024

Notes

1.
A module is a processing unit, executed in serial or parallel, in a workflow, and is also referred to as a job or subtask in some context.

References

Zaharia, P.W.M., Xin, R.S., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Oinn, J.F.T., Addis, M., et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004)
Article Google Scholar
Ludascher, I.A.C.B.B., et al.: Scientific workflow management and the Kepler system. Spec. Issue Workflow Grid Syst. 18, 1039–1065 (2005)
Google Scholar
Deelman, E., Blythe, J., et al.: Pegasus: mapping scientific workflows onto the grid. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 11–20. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28642-4_2
Chapter Google Scholar
Kumar, G.M.V.S., Sadayappan, P., et al.: An integrated framework for performance-based optimization of scientific workflows. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, Garching, Germany, pp. 177–186 (2009)
Google Scholar
Chiu, G.A.D., Deshpande, S., et al.: Cost and accuracy sensitive dynamic workflow composition over grid environments. In: Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing, Washington, DC, USA, pp. 9–16 (2008)
Google Scholar
Holl, M.P.S., Zimmermann, O., et al.: A new optimization phase for scientific workflow management systems. Future Gener. Comput. Sci. 36, 352–362 (2014)
Article Google Scholar
Counaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Article Google Scholar
Wang, B.H.G., Xu, J., et al.: A novel method for tuning configuration parameters of spark based on machine learning. In: IEEE 18th International Conference on High Performance Computing and Communications, Sydney, NSW, Austrilia (2016)
Google Scholar
Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of mapreduce. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_42
Chapter Google Scholar
Wu, A.G.D., et al.: A self-tuning system based on application profiling and performance analysis for optimizing hadoop mapreduce cluster configuration. In: 20th Annual International Conference on High Performance Computing, Bangalore, India (2014)
Google Scholar
Li, S.M., Zeng, L., et al.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, New York, NY, USA, pp. 165–176 (2014)
Google Scholar
Shu, T., Wu, C.: Performance optimization of $\mathit{H}$adoop workflows in public clouds through adaptive task partitioning. In: Proceedings of the IEEE INFOCOM, Atlanta, GA, USA, 1–4 May 2017
Google Scholar
Wu, C., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comp. 3(2), 169–181 (2015)
Article Google Scholar
Yun, D., Wu, C., Gu, Y.: An integrated approach to workflow mapping and task scheduling for delay minimization in distributed environments. JPDC 84, 51–64 (2015)
Google Scholar
Ye, Q., Wu, C.Q., Cao, H., et al.: Storage-aware task scheduling for performance optimization of big data workflows. In: The 8th IEEE International Conference on Big Data and Cloud Computing, Melbourne, Australia, 11–13 December 2018
Google Scholar
Wang, B.H.E.G. , Xu, J.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on HPC and Communications, Sydney, NSW, Australia, 12–14 December 2016
Google Scholar
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24
Chapter Google Scholar
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11, 22–32 (2018)
Article Google Scholar
Jia, G.C.E.Z., Xue, C.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), Haifa, Israel, 11–15 September 2016
Google Scholar
Holmes, A.: Hadoop in Practice. Manning Publications Co., Greenwich (2012)
Google Scholar
Li, S.M.E.M., Zeng, L.: MRONLINE: MapReduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, Vancouver, BC, Canada, 23–27 June 2014
Google Scholar
Ding, D.Q.E.X., Liu, Y.: Jellyfish: online performance tuning with adaptive configuration and elastic container in hadoop yarn. In: 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia, 14–17 December 2015
Google Scholar
Flight Data. http://stat-computing.org/dataexpo/2009/the-data.html
Library Checkout Data. https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6
Parking Violation Data. https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89
Service Request Data. https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
Spall, J.C.: Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37, 332–341 (1992)
Article MathSciNet Google Scholar
Kiefer, J.W.J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)
Article MathSciNet Google Scholar
Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, Hoboken (2005)
Google Scholar
Ross, B.: Mutual information between discrete and continuous data sets. PLOS ONE 9(2), 1–5 (2014)
Google Scholar
Abramowitz, M., Stegun, I.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover Publishing Inc., New York (1972)
MATH Google Scholar
Spall, J.C.: Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Trans. Aerosp. Electron. Syst. 34, 817–823 (1998)
Article Google Scholar
Heger, D.: Hadoop performance tuning-a pragmatic & iterative approach. CMG J. 4, 97–113 (2013)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Lawler, G., Limic, V.: Random Walk: A Modern Introduction. Cambridge University Press, Cambridge (2010)
Google Scholar
Glover, F.: Tabu search: a tutorial. Informs J. Appl. Anal. 20(4), 1–185 (1990)
Article Google Scholar
Montgomery, E.A.P.D.C., Vining, G.: Introduction To Linear Regression Analysis, vol. 821. Wiley, Hoboken (2012)
Google Scholar
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Heidelberg (2006)
MATH Google Scholar
Apache, Hadoop (2016). http://hadoop.apache.org
Spark (2016). http://spark.apache.org
Oozie (2016). https://oozie.apache.org

Download references

Acknowledgments

This research is sponsored by U.S. National Science Foundation under Grant No. CNS-1828123 with New Jersey Institute of Technology.

Author information

Authors and Affiliations

Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
Qianwen Ye, Chase Q. Wu & Wuji Liu
School of Information Science and Technology, Northwest University, Xi’an, 710127, Shaanxi, China
Aiqin Hou
School of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou, 310018, Zhejiang, China
Wei Shen

Authors

Qianwen Ye
View author publications
You can also search for this author in PubMed Google Scholar
Chase Q. Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wuji Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aiqin Hou
View author publications
You can also search for this author in PubMed Google Scholar
Wei Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chase Q. Wu .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, Q., Wu, C.Q., Liu, W., Hou, A., Shen, W. (2020). Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12454. Springer, Cham. https://doi.org/10.1007/978-3-030-60248-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-60248-2_14
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60247-5
Online ISBN: 978-3-030-60248-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics