research-article

Program optimization space pruning for a multithreaded gpu

Authors:

Christopher I. Rodrigues,

Sara S. Baghsorkhi,

John A. Stratton,

Wen-mei W. HwuAuthors Info & Claims

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

Pages 195 - 204

https://doi.org/10.1145/1356058.1356084

Published: 06 April 2008 Publication History

Abstract

Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process.

This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

References

[1]

NVIDIA CUDA. http://www.nvidia.com/cuda.

[2]

SPIRAL project. http://spiral.net.

[3]

F. Agakov et al. Using machine learning to focus iterative optimization. In Proceedings of the 4th Annual International Symposium on Code Generation and Optimization, pages 295--305, March 2006.

Digital Library

[4]

L. Almagor et al. Finding effective compilation sequences. Proceedings of the 2004 ACM Conference on Languages, Compilers, and Tools for Embedded Systems, pages 231--239.

Digital Library

[5]

O. Avissar, R. Barua, and D. Stewart. An optimal memory allocation scheme for scratch--pad based embedded systems. ACM Transactions on Embedded Computing Systems, 1(1):6--26, November 2002.

Digital Library

[6]

W. Blume et al. Polaris: The next generation in parallelizing compilers. Technical Report 1375, University of Illinois at Urbana--Champaign, 1994.

[7]

I. Buck. Brook Specification v0.2, October 2003.

[8]

K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the 2004 ACM Conference on Graphics Hardware, pages 133--137.

Digital Library

[9]

S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, 1998.

Digital Library

[10]

N. K. Govindaraju et al. A memory model for scientific algorithms on graphics processors. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pages 89--99.

Digital Library

[11]

M. W. Hall et al. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84--89, 1996.

Digital Library

[12]

H. Han, G. Rivera, and C.--W. Tseng. Software support for improving locality in scientific codes. In 8th Workshop on Compilers for Parallel Computers, January 2000.

[13]

M. Haneda, P. M. W. Knijnenburg, and H. A. G. Wijshoff. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 123--132, September 2005.

Digital Library

[14]

C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In Proceedings of the 14th International Conference on Parallel Architecture and Compilation Techniques, pages 185--196, September 2005.

Digital Library

[15]

D. Jimenez--Gonzalez, X. Martorell, and A. Ramirez. Performance analysis of Cell Broadband Engine for high memory bandwidth applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pages 210--219, April 2007.

[16]

K. Kennedy and R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2002.

Digital Library

[17]

T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O'Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 237--248.

Digital Library

[18]

P. A. Kulkarni et al. Evaluation heuristic optimization phase order search algorithms. In Proceedings of the 2007 International Symposium on Code Generation and Optimization, pages 157--169, March 2007.

Digital Library

[19]

M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, April 1991.

Digital Library

[20]

J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum, May 2007.

[21]

NVIDIA Corporation. CUDA Programming Guide, February 2007.

[22]

S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008.

Digital Library

[23]

J. E. Stone et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618--2640, December 2007.

[24]

S. Stone et al. How GPUs can improve the quality of magnetic resonance imaging. The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.

[25]

D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, October 2006.

Digital Library

[26]

S. Triantafyllis et al. Compiler optimization-space exploration. In Proceedings of the 2003 International Symposium on Code Generation and Optimization, pages 204--215.

Digital Library

[27]

K. Vaswani et al. Microarchitecture sensitive empirical models for compiler optimizations. In Proceedings of the 2007 International Symposium on Code Generation and Optimization, pages 131--143.

Digital Library

[28]

M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pages 274--286, December 1996.

Digital Library

[29]

H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. Addison-Wesley Publishing Company, Reading, MA, 1991.

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Schoonhoven Rvan Werkhoven BBatenburg K(2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
https://doi.org/10.1109/TEVC.2022.3210654
Hwu WKirk DEl Hajj I(2023)Compute architecture and schedulingProgramming Massively Parallel Processors10.1016/B978-0-323-91231-0.00003-3(69-92)Online publication date: 2023
https://doi.org/10.1016/B978-0-323-91231-0.00003-3
Show More Cited By

Index Terms

Program optimization space pruning for a multithreaded gpu
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory ...
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs ...
Architecture-aware optimization targeting multithreaded stream computing
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

Optimizing program execution targeted for Graphics Processing Units (GPUs) can be very challenging. Our ability to efficiently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

April 2008

235 pages

ISBN:9781595939784

DOI:10.1145/1356058

General Chair:
Mary Lou Soffa
University of Virginia, USA
,
Program Chair:
Evelyn Duesterwald
IBM Research, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '08

Sponsor:

CGO '08: 6th Annual IEEE / ACM International Symposium on Code Generation and Optimization

April 5 - 9, 2008

MA, Boston, USA

Acceptance Rates

CGO '08 Paper Acceptance Rate 21 of 66 submissions, 32%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

218
Total Citations
View Citations
2,023
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Schoonhoven Rvan Werkhoven BBatenburg K(2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
https://doi.org/10.1109/TEVC.2022.3210654
Hwu WKirk DEl Hajj I(2023)Compute architecture and schedulingProgramming Massively Parallel Processors10.1016/B978-0-323-91231-0.00003-3(69-92)Online publication date: 2023
https://doi.org/10.1016/B978-0-323-91231-0.00003-3
Schoonhoven RVeenboer BVan Werkhoven BBatenburg K(2022)Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS56514.2022.00010(48-59)Online publication date: Nov-2022
https://doi.org/10.1109/PMBS56514.2022.00010
Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Lee SKim T(2021)Parallel Dislocation Model Implementation for Earthquake Source Parameter Estimation on Multi-Threaded GPUApplied Sciences10.3390/app1120943411:20(9434)Online publication date: 11-Oct-2021
https://doi.org/10.3390/app11209434
Wang XQian XKnoll AHuang K(2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2958343
Szenasi SFried ZFelde I(2020)GPU Accelerated Heat Transfer Simulation Supporting Heuristics To Solve The Inverse Heat Conduction Problem2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI48414.2020.9108768(287-292)Online publication date: Jan-2020
https://doi.org/10.1109/SAMI48414.2020.9108768
Lin CLiu JYang P(2020)Performance Enhancement of GPU Parallel Computing Using Memory Allocation Optimization2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM48794.2020.9001771(1-5)Online publication date: Jan-2020
https://doi.org/10.1109/IMCOM48794.2020.9001771
van Werkhoven BPalenstijn WSclocco A(2020)Lessons Learned in a Decade of Research Software Engineering GPU ApplicationsComputational Science – ICCS 202010.1007/978-3-030-50436-6_29(399-412)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50436-6_29
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents