Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Program optimization carving for GPU computing

Published: 01 October 2008 Publication History

Abstract

Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.

References

[1]
F. Agakov, et al. Using machine learning to focus iterative optimization, in: Proc. 4th Annual International Symposium on Code Generation and Optimization, 2006, pp. 295-305
[2]
F.E. Allen, M. Burke, P. Charles, R. Cytron, J. Ferrante, An overview of the PTRAN analysis system for multiprocessing, in: Proc. 1st International Conference on Supercomputing, 1987, pp. 194-211
[3]
J. Allen, K. Kennedy, Automatic loop interchange, in: Proc. ACM SIGPLAN '84 Symposium on Compiler Construction, 1984, pp. 233-246
[4]
Allen, J.R. and Kennedy, K., PFC: A program to convert Fortran to parallel form. In: Hwang, K. (Ed.), Supercomputers: Design and Applications, IEEE Computer Society Press, Los Alamitos, CA. pp. 186-203.
[5]
M.M. Baskaran, et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories, in: Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 1-10
[6]
I. Buck, Brook Specification v0.2, October 2003
[7]
D. Callahan, S. Carr, K. Kennedy, Improving register allocation for subscripted variables, in: Proc. SIGPLAN 1990 Conference on Program Language Design and Implementation, 1990, pp. 53-65
[8]
Y. Chou, B. Fahs, S. Abraham, Microarchitecture optimizations for exploiting memory-level parallelism, in: Proc. 31th Annual International Symposium on Computer Architecture, 2004, pp. 76-88
[9]
K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, in: Proc. 2004 ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2004, pp. 133-137
[10]
S. Ghosh, M. Martonosi, S. Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, in: Proc. 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998, pp. 228-239
[11]
N.K. Govindaraju, S. Larsen, J. Gray, D. Manocha, A memory model for scientific algorithms on graphics processors, in: Proc. 2006 ACM/IEEE Conference on Supercomputing, No. 89, 2006, pp. 89-99
[12]
Han, H., Rivera, G. and Tseng, C.-W., Software support for improving locality in scientific codes. In: Workshop on Compilers for Parallel Computers,
[13]
M. Haneda, P.M.W. Knijnenburg, H.A.G. Wijshoff, Automatic selection of compiler options using non-parametric inferential statistics, in: Proc. 14th International Conference on Parallel Architectures and Compilation Techniques, 2005, pp. 123-132
[14]
C. Jiang, M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, in: Proc. 14th International Conference on Parallel Architecture and Compilation Techniques, 2005, pp. 185-196
[15]
D. Jimenez-Gonzalez, X. Martorell, A. Ramirez, Performance analysis of Cell Broadband Engine for high memory bandwidth applications, in: Proc. IEEE International Symposium on Performance Analysis of Systems and Software, 2007, pp. 210-219
[16]
Kennedy, K. and Allen, R., Optimizing Compilers for Modern Architectures: A Dependence-based Approach. 2002. Morgan Kaufmann Publishers, San Francisco, CA.
[17]
T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle, Combined selection of tile sizes and unroll factors using iterative compilation, in: Proc. 2000 International Conference on Parallel Architectures and Compilation Techniques, 2000, pp. 237-248
[18]
D.J. Kuck, et al. The effects of program restructuring, algorithm change, and architecture choice on program performance, in: Proc. 13th International Conference on Parallel Processing, 1984, pp. 129-138
[19]
P.A. Kulkarni, D.B. Whalley, G.S. Tyson, J.W. Davidson, Evaluation heuristic optimization phase order search algorithms, in: Proc. 2007 International Symposium on Code Generation and Optimization, 2007, pp. 157-169
[20]
J. Nickolls, I. Buck, NVIDIA CUDA software and GPU parallel computing architecture, Microprocessor Forum, May 2007
[21]
Pham, D., The design and implementation of a first-generation CELL processor. In: IEEE International Solid-State Circuits Conference,
[22]
Püschel, M., SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Journal of High Performance Computing and Applications. v18 i1. 21-45.
[23]
Ryoo, S., Program optimization study on a 128-core GPU. In: The First Workshop on General Purpose Processing on Graphics Processing Units,
[24]
S. Ryoo, Program optimization strategies for many-core processors, Ph.D. thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 2008
[25]
S. Ryoo, et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, in: Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 73-82
[26]
J.W. Sias, A systematic approach to delivering instruction-level parallelism in EPIC systems, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005
[27]
D. Tarditi, S. Puri, J. Oglesby, Accelerator: Using data parallelism to program GPUs for general-purpose uses, in: Proc. 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325-335
[28]
S. Triantafyllis, M. Vachharajani, N. Vachharajani, D.I. August, Compiler optimization-space exploration, in: Proc. 2003 International Symposium on Code Generation and Optimization, 2003, pp. 204-215
[29]
D.N. Truong, F. Bodin, A. Seznec, Improving cache behavior of dynamically allocated data structures, in: Proc. Seventh International Conference on Parallel Architectures and Compilation Techniques, 1998, pp. 322+
[30]
M.E. Wolf, D.E. Maydan, D.-K. Chen, Combining loop transformations considering caches and scheduling, in: Proc. 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996, pp. 274-286
[31]
M. Wolfe, Iteration space tiling for memory hierarchies, in: Proc. Third SIAM Conference on Parallel Processing for Scientific Computing, 1987, pp. 357-361
[32]
Y. Yamada, J. Gyllenhaal, G. Haab, W.W. Hwu, Data relocation and prefetching for large data sets, in: Proc. 27th Annual ACM/IEEE International Symposium on Microarchitecture, 1994, pp. 118-127
[33]
M. Zhao, B. Childers, M.L. Soffa, Predicting the impact of optimizations for embedded systems, in: Proc. 2003 Conference on Languages, Compilers, and Tools for Embedded Systems, 2003, pp. 1-11
[34]
Zima, H. and Chapman, B., Supercompilers for Parallel and Vector Computers. 1991. Addison-Wesley Publishing Company, Reading, MA.

Cited By

View all
  • (2025)CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement LearningProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708943(493-506)Online publication date: 1-Mar-2025
  • (2024)Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUsThe Journal of Supercomputing10.1007/s11227-023-05884-y80:8(11798-11815)Online publication date: 1-May-2024
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 68, Issue 10
October, 2008
99 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 October 2008

Author Tags

  1. GPU computing
  2. Optimization space exploration
  3. Parallel computing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement LearningProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708943(493-506)Online publication date: 1-Mar-2025
  • (2024)Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUsThe Journal of Supercomputing10.1007/s11227-023-05884-y80:8(11798-11815)Online publication date: 1-May-2024
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2018)GPU Neural Mutli Objective Solver for Optical Burst GroomingProcedia Computer Science10.1016/j.procs.2018.01.115127:C(200-208)Online publication date: 1-May-2018
  • (2018)How to exploit high performance computing in population-based metaheuristics for solving association rule mining problemDistributed and Parallel Databases10.1007/s10619-018-7218-436:2(369-397)Online publication date: 1-Jun-2018
  • (2016)ZoruaThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195656(1-14)Online publication date: 15-Oct-2016
  • (2016)A distributed OpenCL framework using redundant computation and data replicationACM SIGPLAN Notices10.1145/2980983.290809451:6(553-569)Online publication date: 2-Jun-2016
  • (2016)A distributed OpenCL framework using redundant computation and data replicationProceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2908080.2908094(553-569)Online publication date: 2-Jun-2016
  • (2015)A high performance crashworthiness simulation system based on GPUAdvances in Engineering Software10.1016/j.advengsoft.2015.04.00386:C(29-38)Online publication date: 1-Aug-2015
  • (2015)Performance analysis of a novel GPU computation-to-core mapping scheme for robust facet image modelingJournal of Real-Time Image Processing10.1007/s11554-012-0272-710:3(485-500)Online publication date: 1-Sep-2015
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media