article

Program optimization carving for GPU computing

Authors:

Christopher I. Rodrigues,

John A. Stratton,

Sara S. Baghsorkhi,

Wen-mei W. HwuAuthors Info & Claims

Journal of Parallel and Distributed Computing, Volume 68, Issue 10

Pages 1389 - 1401

https://doi.org/10.1016/j.jpdc.2008.05.011

Published: 01 October 2008 Publication History

Abstract

Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.

References

[1]

F. Agakov, et al. Using machine learning to focus iterative optimization, in: Proc. 4th Annual International Symposium on Code Generation and Optimization, 2006, pp. 295-305

[2]

F.E. Allen, M. Burke, P. Charles, R. Cytron, J. Ferrante, An overview of the PTRAN analysis system for multiprocessing, in: Proc. 1st International Conference on Supercomputing, 1987, pp. 194-211

Digital Library

[3]

J. Allen, K. Kennedy, Automatic loop interchange, in: Proc. ACM SIGPLAN '84 Symposium on Compiler Construction, 1984, pp. 233-246

Digital Library

[4]

Allen, J.R. and Kennedy, K., PFC: A program to convert Fortran to parallel form. In: Hwang, K. (Ed.), Supercomputers: Design and Applications, IEEE Computer Society Press, Los Alamitos, CA. pp. 186-203.

[5]

M.M. Baskaran, et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories, in: Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 1-10

Digital Library

[6]

I. Buck, Brook Specification v0.2, October 2003

[7]

D. Callahan, S. Carr, K. Kennedy, Improving register allocation for subscripted variables, in: Proc. SIGPLAN 1990 Conference on Program Language Design and Implementation, 1990, pp. 53-65

Digital Library

[8]

Y. Chou, B. Fahs, S. Abraham, Microarchitecture optimizations for exploiting memory-level parallelism, in: Proc. 31th Annual International Symposium on Computer Architecture, 2004, pp. 76-88

[9]

K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, in: Proc. 2004 ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2004, pp. 133-137

[10]

S. Ghosh, M. Martonosi, S. Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, in: Proc. 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998, pp. 228-239

[11]

N.K. Govindaraju, S. Larsen, J. Gray, D. Manocha, A memory model for scientific algorithms on graphics processors, in: Proc. 2006 ACM/IEEE Conference on Supercomputing, No. 89, 2006, pp. 89-99

[12]

Han, H., Rivera, G. and Tseng, C.-W., Software support for improving locality in scientific codes. In: Workshop on Compilers for Parallel Computers,

[13]

M. Haneda, P.M.W. Knijnenburg, H.A.G. Wijshoff, Automatic selection of compiler options using non-parametric inferential statistics, in: Proc. 14th International Conference on Parallel Architectures and Compilation Techniques, 2005, pp. 123-132

[14]

C. Jiang, M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, in: Proc. 14th International Conference on Parallel Architecture and Compilation Techniques, 2005, pp. 185-196

[15]

D. Jimenez-Gonzalez, X. Martorell, A. Ramirez, Performance analysis of Cell Broadband Engine for high memory bandwidth applications, in: Proc. IEEE International Symposium on Performance Analysis of Systems and Software, 2007, pp. 210-219

[16]

Kennedy, K. and Allen, R., Optimizing Compilers for Modern Architectures: A Dependence-based Approach. 2002. Morgan Kaufmann Publishers, San Francisco, CA.

[17]

T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, Combined selection of tile sizes and unroll factors using iterative compilation, in: Proc. 2000 International Conference on Parallel Architectures and Compilation Techniques, 2000, pp. 237-248

[18]

D.J. Kuck, et al. The effects of program restructuring, algorithm change, and architecture choice on program performance, in: Proc. 13th International Conference on Parallel Processing, 1984, pp. 129-138

[19]

P.A. Kulkarni, D.B. Whalley, G.S. Tyson, J.W. Davidson, Evaluation heuristic optimization phase order search algorithms, in: Proc. 2007 International Symposium on Code Generation and Optimization, 2007, pp. 157-169

[20]

J. Nickolls, I. Buck, NVIDIA CUDA software and GPU parallel computing architecture, Microprocessor Forum, May 2007

[21]

Pham, D., The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference,

[22]

Püschel, M., SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Journal of High Performance Computing and Applications. v18 i1. 21-45.

[23]

Ryoo, S., Program optimization study on a 128-core GPU. In: The First Workshop on General Purpose Processing on Graphics Processing Units,

[24]

S. Ryoo, Program optimization strategies for many-core processors, Ph.D. thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 2008

[25]

S. Ryoo, et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, in: Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 73-82

Digital Library

[26]

J.W. Sias, A systematic approach to delivering instruction-level parallelism in EPIC systems, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005

[27]

D. Tarditi, S. Puri, J. Oglesby, Accelerator: Using data parallelism to program GPUs for general-purpose uses, in: Proc. 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325-335

[28]

S. Triantafyllis, M. Vachharajani, N. Vachharajani, D.I. August, Compiler optimization-space exploration, in: Proc. 2003 International Symposium on Code Generation and Optimization, 2003, pp. 204-215

[29]

D.N. Truong, F. Bodin, A. Seznec, Improving cache behavior of dynamically allocated data structures, in: Proc. Seventh International Conference on Parallel Architectures and Compilation Techniques, 1998, pp. 322+

[30]

M.E. Wolf, D.E. Maydan, D.-K. Chen, Combining loop transformations considering caches and scheduling, in: Proc. 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996, pp. 274-286

Digital Library

[31]

M. Wolfe, Iteration space tiling for memory hierarchies, in: Proc. Third SIAM Conference on Parallel Processing for Scientific Computing, 1987, pp. 357-361

Digital Library

[32]

Y. Yamada, J. Gyllenhaal, G. Haab, W.W. Hwu, Data relocation and prefetching for large data sets, in: Proc. 27th Annual ACM/IEEE International Symposium on Microarchitecture, 1994, pp. 118-127

[33]

M. Zhao, B. Childers, M.L. Soffa, Predicting the impact of optimizations for embedded systems, in: Proc. 2003 Conference on Languages, Compilers, and Tools for Embedded Systems, 2003, pp. 1-11

[34]

Zima, H. and Chapman, B., Supercompilers for Parallel and Vector Computers. 1991. Addison-Wesley Publishing Company, Reading, MA.

Cited By

He GYoneki EDoerfert JGrosser TLeather HSadayappan P(2025)CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement LearningProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708943(493-506)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1145/3696443.3708943
Accordi GGadioli DVitali ECrisci LCosenza BBeccari APalermo G(2024)Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUsThe Journal of Supercomputing10.1007/s11227-023-05884-y80:8(11798-11815)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11227-023-05884-y
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Show More Cited By

Index Terms

Program optimization carving for GPU computing

Recommendations

OpenCL implementation of particle swarm optimization: a comparison between multi-core CPU and GPU performances
EvoApplications'12: Proceedings of the 2012t European conference on Applications of Evolutionary Computation

GPU-based parallel implementations of algorithms are usually compared against the corresponding sequential versions compiled for a single-core CPU machine, without taking advantage of the multi-core and SIMD capabilities of modern processors. This leads ...
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory ...
Use of GPU computing for uncertainty quantification in computational mechanics: A case study

Graphics processing units GPUs are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 68, Issue 10

October, 2008

99 pages

ISSN:0743-7315

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2008.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 October 2008

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

He GYoneki EDoerfert JGrosser TLeather HSadayappan P(2025)CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement LearningProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708943(493-506)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1145/3696443.3708943
Accordi GGadioli DVitali ECrisci LCosenza BBeccari APalermo G(2024)Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUsThe Journal of Supercomputing10.1007/s11227-023-05884-y80:8(11798-11815)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11227-023-05884-y
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Larhlimi AIlloussamen YMestari M(2018)GPU Neural Mutli Objective Solver for Optical Burst GroomingProcedia Computer Science10.1016/j.procs.2018.01.115127:C(200-208)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1016/j.procs.2018.01.115
Djenouri YDjenouri DHabbas ZBelhadi A(2018)How to exploit high performance computing in population-based metaheuristics for solving association rule mining problemDistributed and Parallel Databases10.1007/s10619-018-7218-436:2(369-397)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s10619-018-7218-4
Vijaykumar NHsieh KPekhimenko GKhan SShrestha AGhose SJog AGibbons PMutlu OHsu WYang CLipasti MLee H(2016)ZoruaThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195656(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195656
Kim JJo GJung JKim JLee J(2016)A distributed OpenCL framework using redundant computation and data replicationACM SIGPLAN Notices10.1145/2980983.290809451:6(553-569)Online publication date: 2-Jun-2016
https://dl.acm.org/doi/10.1145/2980983.2908094
Kim JJo GJung JKim JLee JKrintz CBerger E(2016)A distributed OpenCL framework using redundant computation and data replicationProceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2908080.2908094(553-569)Online publication date: 2-Jun-2016
https://dl.acm.org/doi/10.1145/2908080.2908094
Cai YWang GLi GWang H(2015)A high performance crashworthiness simulation system based on GPUAdvances in Engineering Software10.1016/j.advengsoft.2015.04.00386:C(29-38)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1016/j.advengsoft.2015.04.003
Park SCao YWatson LQuek F(2015)Performance analysis of a novel GPU computation-to-core mapping scheme for robust facet image modelingJournal of Real-Time Image Processing10.1007/s11554-012-0272-710:3(485-500)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/s11554-012-0272-7
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents