Article

OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Authors:

Rudolf EigenmannAuthors Info & Claims

SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1 - 11

https://doi.org/10.1109/SC.2010.36

Published: 13 November 2010 Publication History

Abstract

General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs.

References

[1]

"OpenMP {Online}. Available: http://openmp.org/wp/."

[2]

S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: A compiler framework for automatic translation and optimization," in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). New York, NY, USA: ACM, Feb. 2009, pp. 101-110.

Digital Library

[3]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu, "Program optimization space pruning for a multithreaded GPU," International Symposium on Code Generation and Optimization (CGO), 2008.

Digital Library

[4]

Y. Liu, E. Z. Zhang, and X. Shen, "A cross-input adaptive framework for GPU program optimizations," 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1-10, 2009.

Digital Library

[5]

T. D. Han and T. S. Abdelrahman, "hiCUDA: a high-level directive-based language for GPU programming," in GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. New York, NY, USA: ACM, 2009, pp. 52-61.

Digital Library

[6]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan, "Automatic C-to-CUDA code generation for affine programs," International Conference on Compiler Construction (CC), vol. Volume 6011/2010, pp. 244-263, March 2010.

Digital Library

[7]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu, "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA," ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 73-82, 2008.

Digital Library

[8]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for GPGPUs," ACM International Conference on Supercomputing (ICS), 2008.

Digital Library

[9]

T. Davis, "University of Florida Sparse Matrix Collection {Online}. Available: http://www.cise.ufl.edu/research/sparse/matrices/."

Digital Library

[10]

A. Nukada and S. Matsuoka, "Auto-tuning 3-D FFT library for CUDA GPUs," in SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2009, pp. 1-10.

Digital Library

[11]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures," in SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1-12.

Digital Library

[12]

V. Volkov and J. W. Demmel, "Benchmarking GPUs to tune dense linear algebra," in SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1-11.

Digital Library

[13]

C. Dave, H. Bae, S.-J. Min, S. Lee, R. Eigenmann, and S. Midkiff, "Cetus: A source-to-source compiler infrastructure for multicores," IEEE Computer, vol. 42, no. 12, pp. 36-42, 2009.

Digital Library

[14]

"NVIDIA CUDA SDK - Data-Parallel Algorithms: Parallel Reduction {Online}. Available: http://developer.download.nvidia.com/compute/cu-da/1_1/Website/Data-Parallel_Algorithms.html."

[15]

Z. Pan and R. Eigenmann, "PEAK--a fast and effective performance tuning system via compiler optimization orchestration," ACM Trans. Program. Lang. Syst., vol. 30, no. 3, pp. 1-43, 2008.

Digital Library

[16]

S. Ueng, M. Lathara, S. S. Baghsorkhi, and W. W. Hwu, "CUDA-lite: Reducing GPU programming complexity," International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008.

Digital Library

Cited By

Bastem BUnat D(2020)Tiling-Based Programming Model for Structured Grids on GPU ClustersProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368485(43-51)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368485
Calore EGabbana ASchifano STripiccione R(2019)Optimization of lattice Boltzmann simulations on heterogeneous computersInternational Journal of High Performance Computing Applications10.1177/109434201770377133:1(124-139)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1177/1094342017703771
Wu SDong XZhang XZhu Z(2019)NoTThe Journal of Supercomputing10.1007/s11227-019-02749-175:7(3810-3841)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11227-019-02749-1
Show More Cited By

OpenMPC: Extended OpenMP Programming and Tuning for GPUs
1. Networks

Recommendations

OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

General-purpose graphics processing units GPGPUs provide inexpensive, high performance platforms for compute-intensive applications. However, their programming complexity poses a significant challenge to developers. Even though the compute unified ...
An Automatic Host and Device Memory Allocation Method for OpenMPC
ICNC '12: Proceedings of the 2012 Third International Conference on Networking and Computing

The CUDA programming model provides better abstraction for GPU programming. However, it is still hard to write programs with CUDA because both some specific techniques and knowledge about GPU architecture are required. Hence, many programming frameworks ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

November 2010

634 pages

ISBN:9781424475599

Conference Chair:
Barry V. Hess

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 November 2010

Check for updates

Qualifiers

Article

Conference

SC '10

Sponsor:

SIGARCH
IEEE-CS

SC '10: International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 19, 2010

Acceptance Rates

SC '10 Paper Acceptance Rate 51 of 253 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

78
Total Citations
View Citations
730
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bastem BUnat D(2020)Tiling-Based Programming Model for Structured Grids on GPU ClustersProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368485(43-51)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368485
Calore EGabbana ASchifano STripiccione R(2019)Optimization of lattice Boltzmann simulations on heterogeneous computersInternational Journal of High Performance Computing Applications10.1177/109434201770377133:1(124-139)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1177/1094342017703771
Wu SDong XZhang XZhu Z(2019)NoTThe Journal of Supercomputing10.1007/s11227-019-02749-175:7(3810-3841)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11227-019-02749-1
Ashcraft MLemon APenry DSnell Q(2019)Compiler Optimization of Accelerator Data TransfersInternational Journal of Parallel Programming10.1007/s10766-017-0549-347:1(39-58)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s10766-017-0549-3
Ramos PSouza GSoares DAraújo GPereira FEvripidou SStenström PO'Boyle M(2018)Automatic annotation of tasks in structured codeProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243200(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243200
Mendonça GGuimarães BAlves PPereira MAraújo GPereira F(2017)DawnCCACM Transactions on Architecture and Code Optimization10.1145/308454014:2(1-25)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3084540
Shirako JHayashi ASarkar VWu PHack S(2017)Optimized two-level parallelization for GPU accelerators using the polyhedral modelProceedings of the 26th International Conference on Compiler Construction10.1145/3033019.3033022(22-33)Online publication date: 5-Feb-2017
https://dl.acm.org/doi/10.1145/3033019.3033022
Li LKessler C(2017)VectorPUProceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms10.1145/3029580.3029582(7-12)Online publication date: 25-Jan-2017
https://dl.acm.org/doi/10.1145/3029580.3029582
Sourouri MBaden SCai X(2017)PandaInternational Journal of Parallel Programming10.1007/s10766-016-0454-145:3(711-729)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10766-016-0454-1
Hayashi AShirako JTiotto EHo RSarkar VChandrasekaran SJuckeland G(2016)Exploring compiler optimization opportunities for the OpenMP 4.x accelerator model on a POWER8+GPU platformProceedings of the Third International Workshop on Accelerator Programming Using Directives10.5555/3019120.3019127(68-78)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019120.3019127
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents