research-article

Open access

MUPPET: Optimizing Performance in OpenMP via Mutation Testing

Authors:

Ignacio Laguna,

Giorgis Georgakoudis,

Konstantinos Parasyris,

Cindy Rubio-GonzálezAuthors Info & Claims

PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 22 - 31

https://doi.org/10.1145/3649169.3649246

Published: 06 March 2024 Publication History

Abstract

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet's mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet's concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as Bayesian Optimization and delta debugging to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations in 70% of the evaluated programs that induce speedup.

References

[1]

2018. Performance Engineering of Software Systems. https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/

[2]

Walid A. Abu-Sufah and Asma Abdel Karim. 2013. Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors. In ISC (Lecture Notes in Computer Science), Vol. 7905. Springer, 151--164.

[3]

Laksono Adhianto, S. Banerjee, Michael W. Fagan, et al. 2010. HPC-TOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 6 (2010), 685--701.

Digital Library

[4]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, et al. 2014. Open-Tuner: an extensible framework for program autotuning. In PACT. ACM, 303--316.

[5]

Md. Abul Kalam Azad, Nafees Iqbal, Foyzul Hassan, et al. 2023. An Empirical Study of High Performance Computing (HPC) Performance Bugs. In MSR. IEEE, 194--206.

[6]

David Beckingsale, Olga Pearce, Ignacio Laguna, et al. 2017. Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code. In IPDPS. IEEE Computer Society, 307--316.

[7]

Alexandru Calotoiu, Torsten Hoefler, Marius Poke, et al. 2013. Using automated performance modeling to find scalability bugs in complex codes. In SC. ACM, 45:1--45:12.

[8]

Pablo C. Cañizares, Alberto Núñez, and Mercedes G. Merayo. 2018. Mutomvo: Mutation testing framework for simulated cloud and HPC environments. J. Syst. Softw. 143 (2018), 187--207.

[9]

Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Citeseer.

[10]

Pedro Delgado-Pérez, Ana Belén Sánchez, Sergio Segura, et al. 2020. Performance mutation testing. Software Testing, Verification and Reliability (2020), e1728.

[11]

Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In ICST Workshops. IEEE Computer Society, 25--31.

[12]

Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In PMBS@SC. IEEE, 7--18.

[13]

Jack J. Dongarra, Mark Gates, Jakub Kurzak, et al. 2018. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators. Proc. IEEE 106, 11 (2018), 2040--2055.

[14]

Jack J. Dongarra, Michael A. Heroux, and Piotr Luszczek. 2016. Highperformance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl. 30, 1 (2016), 3--10.

Digital Library

[15]

Akash Dutta, Jordi Alcaraz, Ali TehraniJamsaz, et al. 2023. Performance Optimization using Multimodal Modeling and Heterogeneous GNN. In HPDC. ACM, 45--57.

[16]

Akash Dutta, Jordi Alcaraz, Ali TehraniJamsaz, et al. 2022. Pattern-based Autotuning of OpenMP Loops using Graph Neural Networks. In AI4S. IEEE, 26--31.

[17]

Alexandre E Eichenberger, John Mellor-Crummey, Martin Schulz, et al. 2013. OMPT: An OpenMP tools application programming interface for performance analysis. In International Workshop on OpenMP. Springer, 171--185.

[18]

The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx). 2013. CoMD - Classical molecular dynamics proxy application. https://github.com/ECP-copa/CoMD.

[19]

Matteo Frigo and Steven G Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.

[20]

Mark Gates, Jakub Kurzak, Piotr Luszczek, et al. 2017. Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices. In IPDPS Workshops. IEEE Computer Society, 1408--1417.

[21]

Markus Geimer, Felix Wolf, Brian JN Wylie, et al. 2010. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22, 6 (2010), 702--719.

[22]

Giorgis Georgakoudis, Johannes Doerfert, Ignacio Laguna, et al. 2020. FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis. In IWOMP (Lecture Notes in Computer Science), Vol. 12295. Springer, 3--17.

[23]

Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, et al. 2023. Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems. arXiv:cs.PL/2303.08873

[24]

Tim Head, Manoj Kumar, Holger Nahrstaedt, et al. 2021. scikit-optimize/scikit-optimize.

[25]

Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649--678.

[26]

Ian Karlin, Abhinav Bhatele, Jeff Keasler, et al. 2013. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application. In IPDPS. IEEE Computer Society, 919--932.

[27]

Jakub Kurzak, Hartwig Anzt, Mark Gates, et al. 2016. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs. IEEE Trans. Parallel Distributed Syst. 27, 7 (2016), 2036--2048.

Digital Library

[28]

Ignacio Laguna, Paul C. Wood, Ranvijay Singh, et al. 2019. GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications. In ISC (Lecture Notes in Computer Science), Vol. 11501. Springer, 227--246.

[29]

Chunhua Liao, Daniel J Quinlan, Richard Vuduc, et al. 2009. Effective source-to-source outlining to support whole program empirical optimization. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 308--322.

[30]

Yu Jung Lo, Samuel Williams, Brian Van Straalen, et al. 2014. Roofline model toolkit: A practical tool for architectural and program analysis. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, 129--148.

[31]

Júnior Löff, Dalvan Griebler, Gabriele Mencagli, et al. 2021. The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Future Gener. Comput. Syst. 125 (2021), 743--757.

Digital Library

[32]

Diogo Marques, Helder Duarte, Aleksandar Ilic, et al. 2017. Performance Analysis with Cache-Aware Roofline Model in Intel Advisor. In HPCS. IEEE, 898--907.

[33]

Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, et al. 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In CGO. ACM, 199--211.

[34]

Jonas Mockus. 1994. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. Glob. Optim. 4, 4 (1994), 347--365.

[35]

Philip J Mucci, Shirley Browne, Christine Deane, et al. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference, Vol. 710. Citeseer.

[36]

Hitoshi Nagasaka, Naoya Maruyama, Akira Nukada, et al. 2010. Statistical power modeling of GPU kernels using performance counters. In International conference on green computing. IEEE, 115--122.

Digital Library

[37]

Rajib Nath, Stanimire Tomov, Jack Dongarra, et al. 2010. Autotuning dense linear algebra libraries on gpus and overview of the magma library. In 6th International Workshop on Parallel Matrix Algorithms and Applications (PMAA'10).

[38]

Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A generic autotuner for OpenCL kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 195--202.

Digital Library

[39]

Konstantinos Parasyris, Giorgis Georgakoudis, Esteban Rangel, et al. 2023. Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay. In SC. ACM, 28:1--28:14.

[40]

Dan Quinlan and Chunhua Liao. 2011. The ROSE source-to-source compiler infrastructure. In Cetus users and compiler infrastructure workshop, in conjunction with PACT, Vol. 2011. Citeseer, 1.

[41]

Prashant Singh Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, et al. 2018. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations. Proc. IEEE 106, 11 (2018), 1902--1920. https://doi.org/10.1109/JPROC.2018.2862896

[42]

Prashant Singh Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, et al. 2019. On Optimizing Complex Stencils on GPUs. In IPDPS. IEEE, 641--652.

[43]

Rohan Basu Roy, Tirthak Patel, Vijay Gadepally, et al. 2021. Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models. In PLDI. ACM, 1280--1295.

[44]

Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, et al. 2013. Precimonious: tuning assistant for floating-point precision. In SC, William Gropp and Satoshi Matsuoka (Eds.). ACM, 27.

[45]

Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.

Digital Library

[46]

Rodolfo Adamshuk Silva, Simone do Rocio Senger de Souza, and Paulo Sergio Lopes de Souza. 2012. Mutation operators for concurrent programs in MPI. In 2012 13th Latin American Test Workshop (LATW). IEEE, 1--6.

[47]

Vinu Sreenivasan, Rajath Javali, Mary Hall, et al. 2019. A framework for enabling OpenMP autotuning. In International Workshop on OpenMP. Springer, 50--60.

Digital Library

[48]

Cristian Tapus, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance Tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC '02). IEEE Computer Society Press, Washington, DC, USA, 1--11.

[49]

Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, et al. 2018. Bootstrapping Parameter Space Exploration for Fast Tuning. In ICS. ACM, 385--395.

[50]

R Clinton Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In SC'98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE, 38--38.

[51]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Digital Library

[52]

Chad Wood, Giorgis Georgakoudis, David Beckingsale, et al. 2021. Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning. In ISC (Lecture Notes in Computer Science), Vol. 12728. Springer, 453--472.

[53]

Yi Yang, Ping Xiang, Mike Mantor, et al. 2012. Fixing performance bugs: An empirical study of open-source GPGPU programs. In 2012 41st International Conference on Parallel Processing. IEEE, 329--339.

Digital Library

[54]

Qing Yi, Keith Seymour, Haihang You, et al. 2007. POET: Parameterized optimizations for empirical tuning. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.

[55]

Xin You, Hailong Yang, Zhonghui Jiang, et al. 2021. DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU. In HPCC/DSS/SmartCity/DependSys. IEEE, 63--70.

[56]

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Trans. Software Eng. 28, 2 (2002), 183--200.

Digital Library

Cited By

Laguna IChapman PParasyris KGeorgakoudis GRubio-González C(2024)Testing the Unknown: A Framework for OpenMP Testing via Random Program GenerationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00080(577-587)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00080

Index Terms

MUPPET: Optimizing Performance in OpenMP via Mutation Testing
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

An automated OpenMP mutation testing framework for performance optimization
Abstract
Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus ...
Synthesis of Parallel Iterative Sorts with Multi-Core Grammatical Evolution
GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation

Writing parallel programs is a challenging but unavoidable proposition to take true advantage of multi-core processors.

In this paper, we extend Multi-core Grammatical Evolution for Parallel Sorting (MCGE-PS) to evolve parallel iterative sorting ...
Performance and Power Characteristics and Optimizations of Hybrid MPI/OpenMP LULESH Miniapps under Various Workloads
E2SC'17: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing

Energy efficient execution of scientific applications requires insight into how HPC system features affect the performance and power of the applications. In this paper, we analyze and model performance and power characteristics of hybrid MPI/OpenMP ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores

March 2024

65 pages

ISBN:9798400705991

DOI:10.1145/3649169

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation
Department of Energy

Conference

PPoPP '24

Sponsor:

PPoPP '24: The 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 3, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)352
Downloads (Last 6 weeks)51

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Laguna IChapman PParasyris KGeorgakoudis GRubio-González C(2024)Testing the Unknown: A Framework for OpenMP Testing via Random Program GenerationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00080(577-587)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00080

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten