Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1356058.1356084acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Program optimization space pruning for a multithreaded gpu

Published: 06 April 2008 Publication History

Abstract

Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process.
This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

References

[1]
NVIDIA CUDA. http://www.nvidia.com/cuda.
[2]
SPIRAL project. http://spiral.net.
[3]
F. Agakov et al. Using machine learning to focus iterative optimization. In Proceedings of the 4th Annual International Symposium on Code Generation and Optimization, pages 295--305, March 2006.
[4]
L. Almagor et al. Finding effective compilation sequences. Proceedings of the 2004 ACM Conference on Languages, Compilers, and Tools for Embedded Systems, pages 231--239.
[5]
O. Avissar, R. Barua, and D. Stewart. An optimal memory allocation scheme for scratch--pad based embedded systems. ACM Transactions on Embedded Computing Systems, 1(1):6--26, November 2002.
[6]
W. Blume et al. Polaris: The next generation in parallelizing compilers. Technical Report 1375, University of Illinois at Urbana--Champaign, 1994.
[7]
I. Buck. Brook Specification v0.2, October 2003.
[8]
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the 2004 ACM Conference on Graphics Hardware, pages 133--137.
[9]
S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228--239, 1998.
[10]
N. K. Govindaraju et al. A memory model for scientific algorithms on graphics processors. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pages 89--99.
[11]
M. W. Hall et al. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84--89, 1996.
[12]
H. Han, G. Rivera, and C.--W. Tseng. Software support for improving locality in scientific codes. In 8th Workshop on Compilers for Parallel Computers, January 2000.
[13]
M. Haneda, P. M. W. Knijnenburg, and H. A. G. Wijshoff. Automatic selection of compiler options using non-parametric inferential statistics. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 123--132, September 2005.
[14]
C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In Proceedings of the 14th International Conference on Parallel Architecture and Compilation Techniques, pages 185--196, September 2005.
[15]
D. Jimenez--Gonzalez, X. Martorell, and A. Ramirez. Performance analysis of Cell Broadband Engine for high memory bandwidth applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pages 210--219, April 2007.
[16]
K. Kennedy and R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2002.
[17]
T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O'Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pages 237--248.
[18]
P. A. Kulkarni et al. Evaluation heuristic optimization phase order search algorithms. In Proceedings of the 2007 International Symposium on Code Generation and Optimization, pages 157--169, March 2007.
[19]
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63--74, April 1991.
[20]
J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum, May 2007.
[21]
NVIDIA Corporation. CUDA Programming Guide, February 2007.
[22]
S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2008.
[23]
J. E. Stone et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618--2640, December 2007.
[24]
S. Stone et al. How GPUs can improve the quality of magnetic resonance imaging. The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.
[25]
D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 325--335, October 2006.
[26]
S. Triantafyllis et al. Compiler optimization-space exploration. In Proceedings of the 2003 International Symposium on Code Generation and Optimization, pages 204--215.
[27]
K. Vaswani et al. Microarchitecture sensitive empirical models for compiler optimizations. In Proceedings of the 2007 International Symposium on Code Generation and Optimization, pages 131--143.
[28]
M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pages 274--286, December 1996.
[29]
H. Zima and B. Chapman. Supercompilers for Parallel and Vector Computers. Addison-Wesley Publishing Company, Reading, MA, 1991.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
April 2008
235 pages
ISBN:9781595939784
DOI:10.1145/1356058
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 April 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpgpu
  2. optimization
  3. parallel computing

Qualifiers

  • Research-article

Conference

CGO '08

Acceptance Rates

CGO '08 Paper Acceptance Rate 21 of 66 submissions, 32%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)6
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2023)Benchmarking Optimization Algorithms for Auto-Tuning GPU KernelsIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.321065427:3(550-564)Online publication date: Jun-2023
  • (2023)Compute architecture and schedulingProgramming Massively Parallel Processors10.1016/B978-0-323-91231-0.00003-3(69-92)Online publication date: 2023
  • (2022)Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS56514.2022.00010(48-59)Online publication date: Nov-2022
  • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
  • (2021)Parallel Dislocation Model Implementation for Earthquake Source Parameter Estimation on Multi-Threaded GPUApplied Sciences10.3390/app1120943411:20(9434)Online publication date: 11-Oct-2021
  • (2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
  • (2020)GPU Accelerated Heat Transfer Simulation Supporting Heuristics To Solve The Inverse Heat Conduction Problem2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI)10.1109/SAMI48414.2020.9108768(287-292)Online publication date: Jan-2020
  • (2020)Performance Enhancement of GPU Parallel Computing Using Memory Allocation Optimization2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM48794.2020.9001771(1-5)Online publication date: Jan-2020
  • (2020)Lessons Learned in a Decade of Research Software Engineering GPU ApplicationsComputational Science – ICCS 202010.1007/978-3-030-50436-6_29(399-412)Online publication date: 15-Jun-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media