Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3337821.3337908acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Massively Parallel Automated Software Tuning

Published: 05 August 2019 Publication History

Abstract

This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.

References

[1]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 303--316.
[2]
Hartwig Anzt, Blake Haugen, Jakub Kurzak, Piotr Luszczek, and Jack Dongarra. 2015. Experiences in autotuning matrix multiplication for energy minimization on GPUs. Concurrency and Computation: Practice and Experience 27, 17 (2015), 5096--5113.
[3]
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th international conference on Supercomputing. ACM, 340--347.
[4]
Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 676--687.
[5]
Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803--820.
[6]
Matteo Frigo and Steven G Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 3. IEEE, 1381--1384.
[7]
Shoaib A Kamil. 2013. Productive high performance parallel programming with auto-tuned domain-specific embedded languages. Ph.D. Dissertation. Electrical Engineering and Computer Sciences, University of California at Berkeley.
[8]
Takahiro Katagiri, Kenji Kise, Hiroaki Honda, and Toshitsugu Yuba. 2003. Fiber: A generalized framework for auto-tuning software. In International Symposium on High Performance Computing. Springer, 146--159.
[9]
Jakub Kurzak, Stanimire Tomov, and Jack Dongarra. 2012. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23, 11 (2012), 2045--2057.
[10]
Piotr Luszczek, Mark Gates, Jakub Kurzak, Anthony Danalis, and Jack Dongarra. 2016. Search space generation and pruning system for autotuners. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 1545--1554.
[11]
Kengo Nakajima, Masaki Satoh, Takashi Furumura, Hiroshi Okuda, Takeshi Iwashita, Hide Sakaguchi, Takahiro Katagiri, Masaharu Matsumoto, Satoshi Ohshima, Hideyuki Jitsumoto, et al. 2016. ppOpen-HPC: open source infrastructure for development and execution of large-scale scientific applications on post-peta-scale supercomputers with automatic tuning (AT). In Optimization in the Real World. Springer, 15--35.
[12]
Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An improved MAGMA GEMM for Fermi graphics processing units. The International Journal of High Performance Computing Applications 24, 4 (2010), 511--515.
[13]
Markus Püschel, José MF Moura, Bryan Singer, Jianxin Xiong, Jeremy Johnson, David Padua, Manuela Veloso, and Robert W Johnson. 2004. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. The International Journal of High Performance Computing Applications 18, 1 (2004), 21--45.
[14]
David E Tanner. 2018. Tensile: Auto-Tuning GEMM GPU Assembly for All Problem Sizes. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1066--1075.
[15]
Cristian Ţăpuş, I-Hsin Chung, Jeffrey K Hollingsworth, et al. 2002. Active harmony: Towards automated performance tuning. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing. IEEE Computer Society Press, 1--11.
[16]
Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, Vol. 16. IOP Publishing, 521.
[17]
R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1-2 (2001), 3--35.

Cited By

View all
  • (2021)Shrinking Sample Search Algorithm for Automatic Tuning of GPU Kernels2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00040(262-271)Online publication date: Dec-2021
  • (2021)Preliminary study on the automatic parallelism optimization model for image enhancement algorithms based on Intel's® Xeon PhiConcurrency and Computation: Practice and Experience10.1002/cpe.626033:16Online publication date: 6-May-2021
  • (2020)MAGMA templates for scalable linear algebra on emerging architecturesThe International Journal of High Performance Computing Applications10.1177/1094342020938421(109434202093842)Online publication date: 10-Jul-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '19: Proceedings of the 48th International Conference on Parallel Processing
August 2019
1107 pages
ISBN:9781450362955
DOI:10.1145/3337821
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automated software tuning
  2. graphics processing unit

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2019

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)13
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Shrinking Sample Search Algorithm for Automatic Tuning of GPU Kernels2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00040(262-271)Online publication date: Dec-2021
  • (2021)Preliminary study on the automatic parallelism optimization model for image enhancement algorithms based on Intel's® Xeon PhiConcurrency and Computation: Practice and Experience10.1002/cpe.626033:16Online publication date: 6-May-2021
  • (2020)MAGMA templates for scalable linear algebra on emerging architecturesThe International Journal of High Performance Computing Applications10.1177/1094342020938421(109434202093842)Online publication date: 10-Jul-2020
  • (2020)Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286221(1-7)Online publication date: 22-Sep-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media