tutorial

Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Authors:

Rajkishore Barik,

Brian T. Lewis,

Tatiana Shpeisman,

Ali-Reza Adl-TabatabaiAuthors Info & Claims

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 33 - 43

https://doi.org/10.1145/2581122.2544165

Published: 15 February 2014 Publication History

Abstract

There is growing interest in using GPUs to accelerate general-purpose computation since they offer the potential of massive parallelism with reduced energy consumption. This interest has been encouraged by the ubiquity of integrated processors that combine a GPU and CPU on the same die, lowering the cost of offloading work to the GPU. However, while the majority of effort has focused on GPU acceleration of regular applications, relatively little is known about the behavior of irregular applications on GPUs. These applications are expected to perform poorly on GPUs without major software engineering effort. We present a compiler framework with support for C++ features that enables GPU acceleration of a wide range of C++ applications with minimal changes. This framework, Concord, includes a low-cost, software SVM implementation that permits seamless sharing of pointer-containing data structures between the CPU and GPU. It also includes compiler optimizations to improve irregular application performance on GPUs. Using Concord, we ran nine irregular C++ programs on two computer systems containing Intel 4th Generation Core processors. One system is an Ultrabook with an integrated HD Graphics 5000 GPU, and the other system is a desktop with an integrated HD Graphics 4600 GPU. The nine applications are pointer-intensive and operate on irregular data structures such as trees and graphs; they include face detection, BTree, single-source shortest path, soft-body physics simulation, and breadth-first search. Our results show that Concord acceleration using the GPU improves energy efficiency by up to 6.04× on the Ultrabook and 3.52× on the desktop over multicore-CPU execution.

References

[1]

First-Rays. http://www.codermind.com/articles/Raytracer-in-C++-Part-I-First-rays.html.

[2]

Galois. http://iss.ices.utexas.edu/?p=projects/galois.

[3]

Intel Corporation. The Intel Thread Building Blocks. http://threading-buildingblocks.org/.

[4]

Khronos OpenCL Working Group. The OpenCL Specification, http://www.khronos.org/opencl/.

[5]

Microsoft Corporation. C++ Accelerated Massive Parallelism Specification.

[6]

NVIDIA Corporation. The CUDA Specification, http://developer.nvidia-.com/object/cuda.html.

[7]

OpenCV. http://sourceforge.net/projects/opencvlibrary/.

[8]

Petme. http://software.intel.com/en-us/articles/multi-core-simulation-of-soft-body-characters-using-cloth/.

[9]

Rodinia. http://lava.cs.virginia.edu/Rodinia/.

[10]

Task Parallel Library (TPL). http://msdn.microsoft.com/en-us/library/dd460717.aspx.

[11]

The Cilk Project. http://supertech.csail.mit.edu/cilk.

[12]

The OpenACC#8482; Application Programming Interface, www.openacc-standard.org/.

[13]

J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah. Lime: a Java-compatible and synthesizable language for heterogeneous architectures. OOPSLA'10.

Digital Library

[14]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. CC'10/ETAPS'10.

Digital Library

[15]

B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. PPoPP'11.

Digital Library

[16]

P. Cooper, U. Dolinsky, A. F. Donaldson, A. Richards, C. Riley, and G. Russell. Offload - automating code migration to heterogeneous multicore systems. HiPEAC'10.

Digital Library

[17]

D. Cunningham, R. Bordawekar, and V. Saraswat. GPU programming in a high level language: compiling X10 to CUDA. X10'11.

[18]

C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). PLDI'12.

Digital Library

[19]

D. Grewe, Z. Wang, and M. F. O'Boyle. Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems. CGO'13.

[20]

M. Grossman, A. S. Sbirlea, Z. Budimlic, and V. Sarkar. CnC-CUDA: Declarative Programming for GPUs. LCPC'10.

Digital Library

[21]

T. D. Han and T. S. Abdelrahman. hiCUDA: High-Level GPGPU Programming. TPDS'11.

Digital Library

[22]

S. Hong, S. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. PPoPP'11.

Digital Library

[23]

J. Knoop, O. Rüthing, and B. Steffen. Optimal code motion: theory and practice. TOPLAS'94.

Digital Library

[24]

J. Lee and H. Kim. TAP: A TLP-aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture. HPCA'12.

Digital Library

[25]

S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. SC'10.

Digital Library

[26]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP'09.

Digital Library

[27]

R. McIlroy and J. Sventek. Hera-JVM: a runtime system for heterogeneous multi-core architectures. OOPSLA'10.

Digital Library

[28]

B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD parallelization of applications that traverse irregular data structures. CGO'13.

Digital Library

[29]

A. Sbîrlea, Y. Zou, Z. Budimlíc, J. Cong, and V. Sarkar. Mapping a data-flow programming model onto heterogeneous platforms. LCTES'12.

[30]

D. Unat, X. Cai, and S. B. Baden. Mint: realizing CUDA performance in 3D stencil methods with annotated C. ICS'11.

Digital Library

[31]

H. Wu, G. Diamos, J. Wang, S. Li, and S. Yalamanchili. Characterization and Transformation of Unstructured Control Flow in bulk synchronous GPU Applications. JHPCA'12.

Digital Library

[32]

E. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for GPU computing. ASPLOS'11.

Digital Library

Cited By

Zhang MAlawneh ARogers TSherwood TBerger EKozyrakis C(2021)Judging a type by its pointer: optimizing GPU virtual functionsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446734(241-254)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446734
Санжаров ВSanzharov VФролов ВFrolov VПавлов ИPavlov I(2019)Restricted Extensions for GPU Photo-realistic RendererGraphiCon'2019 Proceedings. Volume 210.30987/graphicon-2019-2-37-42(37-42)Online publication date: 5-Nov-2019
https://doi.org/10.30987/graphicon-2019-2-37-42
Zhang FLiu WFeng NZhai JDu X(2019)Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processorsCCF Transactions on High Performance Computing10.1007/s42514-019-00008-61:2(131-143)Online publication date: 12-Jun-2019
https://doi.org/10.1007/s42514-019-00008-6
Show More Cited By

Index Terms

Efficient Mapping of Irregular C++ Applications to Integrated GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Adaptive heterogeneous scheduling for integrated GPUs
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute ...
Efficient Mapping of Irregular C++ Applications to Integrated GPUs
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

There is growing interest in using GPUs to accelerate general-purpose computation since they offer the potential of massive parallelism with reduced energy consumption. This interest has been encouraged by the ubiquity of integrated processors that ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 2014

328 pages

ISBN:9781450326704

DOI:10.1145/2581122

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

CGO '14

Sponsor:

CGO '14: 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 15 - 19, 2014

FL, Orlando, USA

Acceptance Rates

CGO '14 Paper Acceptance Rate 29 of 100 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
651
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang MAlawneh ARogers TSherwood TBerger EKozyrakis C(2021)Judging a type by its pointer: optimizing GPU virtual functionsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446734(241-254)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446734
Санжаров ВSanzharov VФролов ВFrolov VПавлов ИPavlov I(2019)Restricted Extensions for GPU Photo-realistic RendererGraphiCon'2019 Proceedings. Volume 210.30987/graphicon-2019-2-37-42(37-42)Online publication date: 5-Nov-2019
https://doi.org/10.30987/graphicon-2019-2-37-42
Zhang FLiu WFeng NZhai JDu X(2019)Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processorsCCF Transactions on High Performance Computing10.1007/s42514-019-00008-61:2(131-143)Online publication date: 12-Jun-2019
https://doi.org/10.1007/s42514-019-00008-6
Barik RShpeisman TRong HHu CLee VAnderson THenry GLiu HWu YPetersen PLowney G(2019)Mozart : Efficient Composition of Library Functions for Heterogeneous ExecutionLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_13(182-202)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_13
Farooqui NRoy IChen YTalwar VBarik RLewis BShpeisman TSchwan K(2018)Accelerating Data Analytics on Integrated GPU Platforms via Runtime SpecializationInternational Journal of Parallel Programming10.1007/s10766-016-0482-x46:2(336-375)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0482-x
Zhang FWu BZhai JHe BChen WReddi VSmith ATang L(2017)FinePar: irregularity-aware fine-grained workload partitioning on integrated architecturesProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049836(27-38)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049836
Dashti MFedorova A(2017)Analyzing memory management methods on integrated CPU-GPU systemsACM SIGPLAN Notices10.1145/3156685.309225652:9(59-69)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3156685.3092256
El Hajj IJablin TMilojicic DHwu W(2017)SAVI objects: sharing and virtuality incorporatedProceedings of the ACM on Programming Languages10.1145/31338691:OOPSLA(1-24)Online publication date: 12-Oct-2017
https://dl.acm.org/doi/10.1145/3133869
Dashti MFedorova AKirsch CTitzer B(2017)Analyzing memory management methods on integrated CPU-GPU systemsProceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management10.1145/3092255.3092256(59-69)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3092255.3092256
Zhang FZhai JHe BZhang SChen W(2017)Understanding Co-Running Behaviors on Integrated CPU/GPU ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.258607428:3(905-918)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2586074
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents