research-article

Public Access

Massively Parallel Automated Software Tuning

Authors:

Yaohung M. Tsai,

Ahmad Abdelfattah,

Jack DongarraAuthors Info & Claims

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Article No.: 92, Pages 1 - 10

https://doi.org/10.1145/3337821.3337908

Published: 05 August 2019 Publication History

Abstract

This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.

References

[1]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 303--316.

Digital Library

[2]

Hartwig Anzt, Blake Haugen, Jakub Kurzak, Piotr Luszczek, and Jack Dongarra. 2015. Experiences in autotuning matrix multiplication for energy minimization on GPUs. Concurrency and Computation: Practice and Experience 27, 17 (2015), 5096--5113.

Digital Library

[3]

Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th international conference on Supercomputing. ACM, 340--347.

Digital Library

[4]

Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 676--687.

Digital Library

[5]

Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803--820.

[6]

Matteo Frigo and Steven G Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 3. IEEE, 1381--1384.

[7]

Shoaib A Kamil. 2013. Productive high performance parallel programming with auto-tuned domain-specific embedded languages. Ph.D. Dissertation. Electrical Engineering and Computer Sciences, University of California at Berkeley.

Digital Library

[8]

Takahiro Katagiri, Kenji Kise, Hiroaki Honda, and Toshitsugu Yuba. 2003. Fiber: A generalized framework for auto-tuning software. In International Symposium on High Performance Computing. Springer, 146--159.

[9]

Jakub Kurzak, Stanimire Tomov, and Jack Dongarra. 2012. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23, 11 (2012), 2045--2057.

Digital Library

[10]

Piotr Luszczek, Mark Gates, Jakub Kurzak, Anthony Danalis, and Jack Dongarra. 2016. Search space generation and pruning system for autotuners. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 1545--1554.

[11]

Kengo Nakajima, Masaki Satoh, Takashi Furumura, Hiroshi Okuda, Takeshi Iwashita, Hide Sakaguchi, Takahiro Katagiri, Masaharu Matsumoto, Satoshi Ohshima, Hideyuki Jitsumoto, et al. 2016. ppOpen-HPC: open source infrastructure for development and execution of large-scale scientific applications on post-peta-scale supercomputers with automatic tuning (AT). In Optimization in the Real World. Springer, 15--35.

Digital Library

[12]

Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An improved MAGMA GEMM for Fermi graphics processing units. The International Journal of High Performance Computing Applications 24, 4 (2010), 511--515.

Digital Library

[13]

Markus Püschel, José MF Moura, Bryan Singer, Jianxin Xiong, Jeremy Johnson, David Padua, Manuela Veloso, and Robert W Johnson. 2004. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. The International Journal of High Performance Computing Applications 18, 1 (2004), 21--45.

Digital Library

[14]

David E Tanner. 2018. Tensile: Auto-Tuning GEMM GPU Assembly for All Problem Sizes. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1066--1075.

[15]

Cristian Ţăpuş, I-Hsin Chung, Jeffrey K Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing. IEEE Computer Society Press, 1--11.

Digital Library

[16]

Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.

[17]

R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1-2 (2001), 3--35.

Cited By

Li XAgrawal G(2021)Shrinking Sample Search Algorithm for Automatic Tuning of GPU Kernels2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00040(262-271)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00040
Huang FYang HTao JWang JTan X(2021)Preliminary study on the automatic parallelism optimization model for image enhancement algorithms based on Intel's® Xeon PhiConcurrency and Computation: Practice and Experience10.1002/cpe.626033:16Online publication date: 6-May-2021
https://doi.org/10.1002/cpe.6260
Al Farhan MAbdelfattah ATomov SGates MSukkari DHaidar ARosenberg RDongarra J(2020)MAGMA templates for scalable linear algebra on emerging architecturesThe International Journal of High Performance Computing Applications10.1177/1094342020938421(109434202093842)Online publication date: 10-Jul-2020
https://doi.org/10.1177/1094342020938421
Show More Cited By

Index Terms

Massively Parallel Automated Software Tuning
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software system models
        Massively parallel systems

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Massively parallel expectation maximization using graphics processing units
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Composed of several hundreds of processors, the Graphics Processing Unit (GPU) has become a very interesting platform for computationally demanding tasks on massive data. A special hierarchy of processors and fast memory units allow very powerful and ...
Parallel Document Inversion using GPU
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Recent advances in the technology of the Graphics Processing Unit (GPU) has led to a surge of interest in using the GPU for general purpose applications. We can utilize the GPU in computation as a massive parallel co-processor because the GPU consists ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

August 2019

1107 pages

ISBN:9781450362955

DOI:10.1145/3337821

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICPP 2019

ICPP 2019: 48th International Conference on Parallel Processing

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)13

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li XAgrawal G(2021)Shrinking Sample Search Algorithm for Automatic Tuning of GPU Kernels2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00040(262-271)Online publication date: Dec-2021
https://doi.org/10.1109/HiPC53243.2021.00040
Huang FYang HTao JWang JTan X(2021)Preliminary study on the automatic parallelism optimization model for image enhancement algorithms based on Intel's® Xeon PhiConcurrency and Computation: Practice and Experience10.1002/cpe.626033:16Online publication date: 6-May-2021
https://doi.org/10.1002/cpe.6260
Al Farhan MAbdelfattah ATomov SGates MSukkari DHaidar ARosenberg RDongarra J(2020)MAGMA templates for scalable linear algebra on emerging architecturesThe International Journal of High Performance Computing Applications10.1177/1094342020938421(109434202093842)Online publication date: 10-Jul-2020
https://doi.org/10.1177/1094342020938421
Cabrera AChamberlain R(2020)Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286221(1-7)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286221

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents