Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3649169.3649246acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

MUPPET: Optimizing Performance in OpenMP via Mutation Testing

Published: 06 March 2024 Publication History

Abstract

Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet's mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet's concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as Bayesian Optimization and delta debugging to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations in 70% of the evaluated programs that induce speedup.

References

[1]
2018. Performance Engineering of Software Systems. https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/
[2]
Walid A. Abu-Sufah and Asma Abdel Karim. 2013. Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors. In ISC (Lecture Notes in Computer Science), Vol. 7905. Springer, 151--164.
[3]
Laksono Adhianto, S. Banerjee, Michael W. Fagan, et al. 2010. HPC-TOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22, 6 (2010), 685--701.
[4]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, et al. 2014. Open-Tuner: an extensible framework for program autotuning. In PACT. ACM, 303--316.
[5]
Md. Abul Kalam Azad, Nafees Iqbal, Foyzul Hassan, et al. 2023. An Empirical Study of High Performance Computing (HPC) Performance Bugs. In MSR. IEEE, 194--206.
[6]
David Beckingsale, Olga Pearce, Ignacio Laguna, et al. 2017. Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code. In IPDPS. IEEE Computer Society, 307--316.
[7]
Alexandru Calotoiu, Torsten Hoefler, Marius Poke, et al. 2013. Using automated performance modeling to find scalability bugs in complex codes. In SC. ACM, 45:1--45:12.
[8]
Pablo C. Cañizares, Alberto Núñez, and Mercedes G. Merayo. 2018. Mutomvo: Mutation testing framework for simulated cloud and HPC environments. J. Syst. Softw. 143 (2018), 187--207.
[9]
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Citeseer.
[10]
Pedro Delgado-Pérez, Ana Belén Sánchez, Sergio Segura, et al. 2020. Performance mutation testing. Software Testing, Verification and Reliability (2020), e1728.
[11]
Alex Denisov and Stanislav Pankevich. 2018. Mull It Over: Mutation Testing Based on LLVM. In ICST Workshops. IEEE Computer Society, 25--31.
[12]
Nan Ding and Samuel Williams. 2019. An Instruction Roofline Model for GPUs. In PMBS@SC. IEEE, 7--18.
[13]
Jack J. Dongarra, Mark Gates, Jakub Kurzak, et al. 2018. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators. Proc. IEEE 106, 11 (2018), 2040--2055.
[14]
Jack J. Dongarra, Michael A. Heroux, and Piotr Luszczek. 2016. Highperformance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. Int. J. High Perform. Comput. Appl. 30, 1 (2016), 3--10.
[15]
Akash Dutta, Jordi Alcaraz, Ali TehraniJamsaz, et al. 2023. Performance Optimization using Multimodal Modeling and Heterogeneous GNN. In HPDC. ACM, 45--57.
[16]
Akash Dutta, Jordi Alcaraz, Ali TehraniJamsaz, et al. 2022. Pattern-based Autotuning of OpenMP Loops using Graph Neural Networks. In AI4S. IEEE, 26--31.
[17]
Alexandre E Eichenberger, John Mellor-Crummey, Martin Schulz, et al. 2013. OMPT: An OpenMP tools application programming interface for performance analysis. In International Workshop on OpenMP. Springer, 171--185.
[18]
The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx). 2013. CoMD - Classical molecular dynamics proxy application. https://github.com/ECP-copa/CoMD.
[19]
Matteo Frigo and Steven G Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.
[20]
Mark Gates, Jakub Kurzak, Piotr Luszczek, et al. 2017. Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices. In IPDPS Workshops. IEEE Computer Society, 1408--1417.
[21]
Markus Geimer, Felix Wolf, Brian JN Wylie, et al. 2010. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22, 6 (2010), 702--719.
[22]
Giorgis Georgakoudis, Johannes Doerfert, Ignacio Laguna, et al. 2020. FAROS: A Framework to Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis. In IWOMP (Lecture Notes in Computer Science), Vol. 12295. Springer, 3--17.
[23]
Giorgis Georgakoudis, Konstantinos Parasyris, Chunhua Liao, et al. 2023. Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems. arXiv:cs.PL/2303.08873
[24]
Tim Head, Manoj Kumar, Holger Nahrstaedt, et al. 2021. scikit-optimize/scikit-optimize.
[25]
Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649--678.
[26]
Ian Karlin, Abhinav Bhatele, Jeff Keasler, et al. 2013. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application. In IPDPS. IEEE Computer Society, 919--932.
[27]
Jakub Kurzak, Hartwig Anzt, Mark Gates, et al. 2016. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs. IEEE Trans. Parallel Distributed Syst. 27, 7 (2016), 2036--2048.
[28]
Ignacio Laguna, Paul C. Wood, Ranvijay Singh, et al. 2019. GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications. In ISC (Lecture Notes in Computer Science), Vol. 11501. Springer, 227--246.
[29]
Chunhua Liao, Daniel J Quinlan, Richard Vuduc, et al. 2009. Effective source-to-source outlining to support whole program empirical optimization. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 308--322.
[30]
Yu Jung Lo, Samuel Williams, Brian Van Straalen, et al. 2014. Roofline model toolkit: A practical tool for architectural and program analysis. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, 129--148.
[31]
Júnior Löff, Dalvan Griebler, Gabriele Mencagli, et al. 2021. The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Future Gener. Comput. Syst. 125 (2021), 743--757.
[32]
Diogo Marques, Helder Duarte, Aleksandar Ilic, et al. 2017. Performance Analysis with Cache-Aware Roofline Model in Intel Advisor. In HPCS. IEEE, 898--907.
[33]
Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, et al. 2020. AN5D: automated stencil framework for high-degree temporal blocking on GPUs. In CGO. ACM, 199--211.
[34]
Jonas Mockus. 1994. Application of Bayesian approach to numerical methods of global and stochastic optimization. J. Glob. Optim. 4, 4 (1994), 347--365.
[35]
Philip J Mucci, Shirley Browne, Christine Deane, et al. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference, Vol. 710. Citeseer.
[36]
Hitoshi Nagasaka, Naoya Maruyama, Akira Nukada, et al. 2010. Statistical power modeling of GPU kernels using performance counters. In International conference on green computing. IEEE, 115--122.
[37]
Rajib Nath, Stanimire Tomov, Jack Dongarra, et al. 2010. Autotuning dense linear algebra libraries on gpus and overview of the magma library. In 6th International Workshop on Parallel Matrix Algorithms and Applications (PMAA'10).
[38]
Cedric Nugteren and Valeriu Codreanu. 2015. CLTune: A generic autotuner for OpenCL kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 195--202.
[39]
Konstantinos Parasyris, Giorgis Georgakoudis, Esteban Rangel, et al. 2023. Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay. In SC. ACM, 28:1--28:14.
[40]
Dan Quinlan and Chunhua Liao. 2011. The ROSE source-to-source compiler infrastructure. In Cetus users and compiler infrastructure workshop, in conjunction with PACT, Vol. 2011. Citeseer, 1.
[41]
Prashant Singh Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, et al. 2018. Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations. Proc. IEEE 106, 11 (2018), 1902--1920. https://doi.org/10.1109/JPROC.2018.2862896
[42]
Prashant Singh Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, et al. 2019. On Optimizing Complex Stencils on GPUs. In IPDPS. IEEE, 641--652.
[43]
Rohan Basu Roy, Tirthak Patel, Vijay Gadepally, et al. 2021. Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models. In PLDI. ACM, 1280--1295.
[44]
Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, et al. 2013. Precimonious: tuning assistant for floating-point precision. In SC, William Gropp and Satoshi Matsuoka (Eds.). ACM, 27.
[45]
Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.
[46]
Rodolfo Adamshuk Silva, Simone do Rocio Senger de Souza, and Paulo Sergio Lopes de Souza. 2012. Mutation operators for concurrent programs in MPI. In 2012 13th Latin American Test Workshop (LATW). IEEE, 1--6.
[47]
Vinu Sreenivasan, Rajath Javali, Mary Hall, et al. 2019. A framework for enabling OpenMP autotuning. In International Workshop on OpenMP. Springer, 50--60.
[48]
Cristian Tapus, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance Tuning. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC '02). IEEE Computer Society Press, Washington, DC, USA, 1--11.
[49]
Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, et al. 2018. Bootstrapping Parameter Space Exploration for Fast Tuning. In ICS. ACM, 385--395.
[50]
R Clinton Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In SC'98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE, 38--38.
[51]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.
[52]
Chad Wood, Giorgis Georgakoudis, David Beckingsale, et al. 2021. Artemis: Automatic Runtime Tuning of Parallel Execution Parameters Using Machine Learning. In ISC (Lecture Notes in Computer Science), Vol. 12728. Springer, 453--472.
[53]
Yi Yang, Ping Xiang, Mike Mantor, et al. 2012. Fixing performance bugs: An empirical study of open-source GPGPU programs. In 2012 41st International Conference on Parallel Processing. IEEE, 329--339.
[54]
Qing Yi, Keith Seymour, Haihang You, et al. 2007. POET: Parameterized optimizations for empirical tuning. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.
[55]
Xin You, Hailong Yang, Zhonghui Jiang, et al. 2021. DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU. In HPCC/DSS/SmartCity/DependSys. IEEE, 63--70.
[56]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Trans. Software Eng. 28, 2 (2002), 183--200.

Cited By

View all
  • (2024)Testing the Unknown: A Framework for OpenMP Testing via Random Program GenerationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00080(577-587)Online publication date: 17-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores
March 2024
65 pages
ISBN:9798400705991
DOI:10.1145/3649169
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024

Check for updates

Author Tags

  1. Bayesian optimization
  2. OpenMP
  3. delta-debugging algorithm
  4. dynamic program analysis
  5. mutation testing
  6. performance optimization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PPoPP '24

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)352
  • Downloads (Last 6 weeks)51
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Testing the Unknown: A Framework for OpenMP Testing via Random Program GenerationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00080(577-587)Online publication date: 17-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media