research-article

Auto-tuning Spark Big Data Workloads on POWER8: Prediction-Based Dynamic SMT Threading

Authors:

Guancheng Chen,

Peter HofsteeAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 387 - 400

https://doi.org/10.1145/2967938.2967957

Published: 11 September 2016 Publication History

Abstract

Much research work devotes to tuning big data analytics in modern data centers, since %the truth that even a small percentage of performance improvement immediately translates to huge cost savings because of the large scale. Simultaneous multithreading (SMT) receives great interest from data center communities, as it has the potential to boost performance of big data analytics by increasing the processor resources utilization. For example, the emerging processor architectures like POWER8 support up to 8-way multithreading. However, as different big data workloads have disparate architectural characteristics, how to identify the most efficient SMT configuration to achieve the best performance is challenging in terms of both complex application behaviors and processor architectures. In this paper, we specifically focus on auto-tuning SMT configuration for Spark-based big data workloads on POWE-R8. However, our methodology could be generalized and extended to other programming software stacks and other architectures.

We propose a prediction-based dynamic SMT threading (PBDST) framework to adjust the thread count in SMT cores on POWER8 processors by using versatile machine learning algorithms.Its innovation lies in adopting online SMT configuration predictions derived from micro-architecture level profiling, to regulate the thread counts that could achieve nearly optimal performance. Moreover it is implemented at Spark software stack layer and transparent to user applications. After evaluating a large set of machine learning algorithms, we choose the most efficient ones to perform online predictions. The experimental results demonstrate that our approach can achieve up to 56.3% performance improvement and an average performance gain of 16.2% in comparison with the default configuration---the maximum SMT configuration---SMT8 on our system.

References

[1]

Apache Flink. https://flink.apache.org/index.html.

[2]

Performance counters for linux. https://perf.wiki.kernel.org/index.php.

[3]

Spark Performance Tests. In Webpage: https://github.com/databricks/spark-perf.

[4]

D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine learning, 6(1):37--66, 1991.

Digital Library

[5]

BAE Detica. The big data refinery: Distilling intelligence from big data. White Paper, https://www.baesystemsdetica.com/uploads/resources/Big_Data_Refinery_Whitepaper_single_pages_19.06.12.pdf, July 2012.

[6]

R. Bertran, M. Gonzalez, X. Martorell, N. Navarro, and E. Ayguade. Decomposable and responsive power models for multicore processors using performance counters. In Proceedings of the 24th ACM International Conference on Supercomputing, pages 147--158. ACM, 2010.

Digital Library

[7]

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984.

[8]

T. Creech, A. Kotha, and R. Barua. Efficient multiprogramming for multicores with scaf. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 334--345. ACM, 2013.

Digital Library

[9]

M. Curtis-Maury, F. Blagojevic, C. D. Antonopoulos, and D. S. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scientific codes. IPDPS, 19(10):1396--1410, 2008.

Digital Library

[10]

M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proceedings of the 20th annual international conference on Supercomputing, pages 157--166. ACM, 2006.

Digital Library

[11]

M. Curtis-Maury, A. Shah, F. Blagojevic, et al. Prediction models for multi-dimensional power-performance optimization on many cores. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 250--259. ACM, 2008.

Digital Library

[12]

M. De Vuyst, R. Kumar, and D. M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors. In 20th International Parallel and Distributed Processing Symposium, pages 10--pp. IEEE, 2006.

Digital Library

[13]

S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling for smt processor scheduling. ACM Sigplan Notices, 45(3):91--102, 2010.

Digital Library

[14]

M. Ferdman, A. Adileh, O. Kocberber, et al. Clearing the clouds. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.

Digital Library

[15]

Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771--780):1612, 1999.

[16]

J. R. Funston, K. El Maghraoui, J. Jann, et al. An smt-selection metric to improve multithreaded applications' performance. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 1388--1399. IEEE, 2012.

Digital Library

[17]

P. Harrington. Machine learning in action. Manning, 2012.

Digital Library

[18]

T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83--85, 2005.

[19]

W. Heirman, T. E. Carlson, K. Van Craeynest, et al. Automatic smt threading for openmp applications on the intel xeon phi co-processor. In Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers. ACM, 2014.

Digital Library

[20]

W. Heirman, T. E. Carlson, K. Van Craeynest, et al. Undersubscribed threading on clustered cache architectures. In IEEE 20th International Symposium on High Performance Computer Architecture, pages 678--689. IEEE, 2014.

[21]

C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions on, 13(2):415--425, 2002.

Digital Library

[22]

W. Huang, J. Lin, Z. Zhang, and J. M. Chang. Performance characterization of java applications on smt processors. In IEEE International Symposium on Performance Analysis of Systems and Software, pages 102--111. IEEE, 2005.

Digital Library

[23]

IDC. The digital universe in 2020: A universe of opportunities and challenges. http://www.emc.com/leadership/digital-universe/iview/executive-summary-a-universe-of.htm, 2012.

[24]

Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo. Characterizing data analysis workloads in data centers. In IEEE International Symposium on Workload Characterization, pages 66--76. IEEE, 2013.

[25]

G. H. John and P. Langley. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 338--345. Morgan Kaufmann Publishers Inc., 1995.

Digital Library

[26]

C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for smt multiprocessor architectures. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 236--246. ACM, 2005.

Digital Library

[27]

J. Lee, J.-H. Park, H. Kim, C. Jung, D. Lim, and S. Han. Adaptive execution techniques of parallel programs for multiprocessors. Journal of Parallel and Distributed Computing, 70(5):467--480, 2010.

Digital Library

[28]

J. Li and J. F. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In The Twelfth International Symposium on High-Performance Computer Architecture, pages 77--87. IEEE, 2006.

[29]

M. Li, G. Chen, Q. Wang, et al. Pater: A hardware prefetching automatic tuner on ibm power8 processor. IEEE Computer Architecture Letters.

Digital Library

[30]

M. Li, J. Tan, Y. Wang, et al. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, page 53. ACM, 2015.

Digital Library

[31]

S.-w. Liao, T.-H. Hung, D. Nguyen, et al. Machine learning-based prefetch optimization for data center applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 56. ACM, 2009.

Digital Library

[32]

D. Lo and C. Kozyrakis. Dynamic management of turbomode in modern multi-core chips. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 603--613. IEEE, 2014.

[33]

H. M. Mathis, A. E. Mericas, J. D. McCalpin, et al. Characterization of simultaneous multithreading (smt) efficiency in power5. IBM Journal of Research and Development, 49(4.5):555--564, 2005.

Digital Library

[34]

D. C. Montgomery, E. A. Peck, and G. G. Vining. Introduction to linear regression analysis, volume 821. John Wiley & Sons, 2012.

[35]

T. Moseley, J. L. Kihm, D. A. Connors, and D. Grunwald. Methods for modeling resource contention on simultaneous multithreading processors. In 2005 International Conference on Computer Design, pages 373--380. IEEE.

Digital Library

[36]

J. C. Platt. Using analytic qp and sparseness to speed training of support vector machines. Advances in neural information processing systems, pages 557--563, 1999.

Digital Library

[37]

J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81--106, 1986.

[38]

Y. Ruan, V. S. Pai, E. Nahum, and J. M. Tracey. Evaluating the impact of simultaneous multithreading on network servers using real hardware. In ACM SIGMETRICS Performance Evaluation Review, volume 33, pages 315--326. ACM, 2005.

Digital Library

[39]

A. Settle, J. Kihm, A. Janiszewski, and D. Connors. Architectural support for enhanced smt job scheduling. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 63--73. IEEE Computer Society, 2004.

Digital Library

[40]

B. Sinharoy, J. A. V. Norstrand, R. J. Eickemeyer, et al. Ibm power8 processor core microarchitecture. IBM Journal of Research and Development, 59(1):2:1--2:21, Jan 2015.

Digital Library

[41]

B. Sinharoy, R. Swanberg, N. Nayar, et al. Advanced features in ibm power8 systems. IBM Journal of Research and Development, 59(1):1:1--1:18, Jan 2015.

Digital Library

[42]

A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous mutlithreading processor. In International Conference on Architecture Support for Programming Languages and Operating Systems., pages 234--244. ACM, 2000.

Digital Library

[43]

D. M. Tullsen, S. J. Eggers, J. S. Emer, et al. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In ACM SIGARCH Computer Architecture News, volume 24, pages 191--202. ACM, 1996.

Digital Library

[44]

A. Vega, A. Buyuktosunoglu, H. Hanson, P. Bose, and S. Ramani. Crank it up or dial it down: coordinated multiprocessor frequency and folding control. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 210--221. ACM, 2013.

Digital Library

[45]

L. Wang, J. Zhan, C. Luo, et al. Bigdatabench: a big data benchmark suite from internet services. In 20th International Symposium on High Performance Computer Architecture, pages 488--499. IEEE, 2014.

[46]

T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009.

Digital Library

[47]

M. Zaharia, M. Chowdhury, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Conference on Networked Systems Design and Implementation, 2012.

Digital Library

[48]

Y. Zhang, M. Burcea, V. Cheng, et al. An adaptive openmp loop scheduler for hyperthreaded smps. In International Conference on Parallel and Distributed Systems, pages 256--263, 2004.

[49]

Y. Zhang, M. A. Laurenzano, J. Mars, and L. Tang. Smite: Precise qos prediction on real-system smt processors to improve utilization in warehouse scale computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 406--418, 2014.

Digital Library

[50]

Y. Zhang, M. Voss, and E. Rogers. Runtime empirical selection of loop schedulers on hyperthreaded smps. In 19th IEEE International Parallel and Distributed Processing Symposium, pages 44b--44b, 2005.

Digital Library

Cited By

Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Chen YHoque MXu PLu JTarkoma S(2023)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s10619-023-07436-y
Ahmed NBarczak ARashid MSusnjak T(2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
https://doi.org/10.1186/s40537-022-00623-1
Show More Cited By

Index Terms

Recommendations

Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis
Big Data Processing Using Spark in Cloud
Conquering Big Data with Spark
BIG DATA '15: Proceedings of the 2015 IEEE International Conference on Big Data (Big Data)

Today, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
479
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Chen YHoque MXu PLu JTarkoma S(2023)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 22-Jun-2023
https://doi.org/10.1007/s10619-023-07436-y
Ahmed NBarczak ARashid MSusnjak T(2022)Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical modelsJournal of Big Data10.1186/s40537-022-00623-19:1Online publication date: 19-May-2022
https://doi.org/10.1186/s40537-022-00623-1
Xin JHwang KYu ZIves ZBonifati AEl Abbadi A(2022)LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526157(674-684)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526157
Krishna RTang CSullivan KRay B(2022)ConEx: Efficient Exploration of Big-Data System Configurations for Better PerformanceIEEE Transactions on Software Engineering10.1109/TSE.2020.300756048:3(893-909)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TSE.2020.3007560
Lin CZhuang JFeng JLi HZhou XLi G(2022)Adaptive Code Learning for Spark Configuration Tuning2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00195(1995-2007)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00195
Ortega CAlvarez LCasas MBertran RBuyuktosunoglu AEichenberger ABose PMoreto M(2021)Intelligent Adaptation of Hardware Knobs for Improving Performance and Power ConsumptionIEEE Transactions on Computers10.1109/TC.2020.298023070:1(1-16)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2980230
Cherubin SAgosta G(2020)Tools for Reduced Precision ComputationACM Computing Surveys10.1145/338103953:2(1-35)Online publication date: 17-Apr-2020
https://dl.acm.org/doi/10.1145/3381039
Herodotou HChen YLu J(2020)A Survey on Automatic Parameter Tuning for Big Data Processing SystemsACM Computing Surveys10.1145/338102753:2(1-37)Online publication date: 26-Apr-2020
https://dl.acm.org/doi/10.1145/3381027
Cong PXu GWei TLi K(2020)A Survey of Profit Optimization Techniques for Cloud ProvidersACM Computing Surveys10.1145/337691753:2(1-35)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3376917
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents