article

Compiler and runtime support for enabling reduction computations on heterogeneous systems

Authors:

Vignesh T. Ravi,

Gagan AgrawalAuthors Info & Claims

Concurrency and Computation: Practice & Experience, Volume 24, Issue 5

Pages 463 - 480

https://doi.org/10.1002/cpe.1848

Published: 01 April 2012 Publication History

Abstract

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a graphics processing unit (GPU). Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU), starting from a high-level API, is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, for example, k-means clustering and principal component analysis, show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared with using only the GPU or the multi-core CPU. In k-means clustering, the heterogeneous version with eight CPU cores and a GPU achieved a speedup of about 32.09x relative to one-thread CPU. When compared with the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In principal component analysis, the heterogeneous version attained a speedup of 10.4x relative to the one-thread CPU version. When compared with the faster of CPU-only and GPU-only versions, the heterogeneous version achieved a performance gain of about 63.8%. Copyright © 2011 John Wiley & Sons, Ltd.

References

[1]

Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In OSDI, 2004; 137–150.

[2]

Ranger C, Raghuraman R, et al. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings of International Symposium on High Performance Computer Architecture, 2007; 13–24.

[3]

Baghsorkhi S, Lathara M, et al. CUDA-lite: reducing GPU programming complexity. In LCPC 2008, 2008.

[4]

Ryoo S, Rodrigues C, et al. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008, ACM, 2008; 195–204.

Digital Library

[5]

He B, Fang W, et al. Mars: a MapReduce framework on graphics processors. In PACT08: IEEE International Conference on Parallel Architecture and Compilation Techniques 2008, 2008.

Digital Library

[6]

Tarditi D, Puri S, et al. Accelerator: using data parallelism to program GPUs for general-purpose uses. In ASPLOS, Shen JP, Martonosi M (eds). ACM, 2006; 325–335.

[7]

Luk C-K, Hong S, et al. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ACM: New York, NY, USA, 2009; 45–55.

[8]

Venkatasubramanian S, Vuduc RW. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In ICS ’09: Proceedings of the 23rd International Conference on Supercomputing, ACM: New York, NY, USA, 2009; 244–255.

[9]

Khronos. OpenCL 1.0. (Available from: )).

[10]

Saha B, Zhou X, et al. Programming model for a heterogeneous x86 platform. SIGPLAN Notices 2009; 44(6): 431–440.

[11]

Jin R, Agrawal G. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. In Proceedings of the Second SIAM Conference on Data Mining, 2002.

[12]

Jin R, Agrawal G. A middleware for developing parallel data mining implementations. In Proceedings of the First SIAM Conference on Data Mining, 2001.

[13]

Zhao W, Stankovic JA. Performance analysis of FCFS and improved FCFS scheduling algorithms for dynamic real-time computer systems. In IEEE Real-Time Systems Symposium, 1989; 156–165.

[14]

Berenbrink P, Friedetzky T, et al. The natural work-stealing algorithm is stable. SIAM Journal on Computing 2003; 32(5): 1260–1279.

[15]

Blumofe RD, Leiserson CE. Scheduling multithreaded computations by work stealing. In SFCS ’94: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, IEEE Computer Society: Washington, DC, USA, 1994; 356–368.

[16]

Cong G, Kodali S, et al. Solving large, irregular graph problems using adaptive work-stealing. In ICPP ’08: Proceedings of the 2008 37th International Conference on Parallel Processing, IEEE Computer Society: Washington, DC, USA, 2008; 536–545.

[17]

Dinan J, Larkins DB, et al. Scalable work stealing. In SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM: New York, NY, USA, 2009; 1–11.

[18]

Ma W, Agrawal G. ICS ’09: Proceedings of the 23rd International Conference on Supercomputing. 2009: 400–409.

[19]

Lattner C, Adve V. LLVM: a compilation framework for lifelong program analysis & transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, 2004.

Digital Library

[20]

Andersen LO. Program analysis and specialization for the C programming language. Ph.D. thesis, DIKU, University of Copenhagen, 1994.

[21]

Pearson K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 1901; 2(6): 559–572.

[22]

Jain AK, Dubes RC. Algorithms for Clustering Data. Prentice Hall, 1988.

Digital Library

[23]

NVidia. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. version 2.0, 2008. (Available from: ).

[24]

Chamberlain RD, Lancaster JM, et al. Visions for application development on hybrid computing systems. Parallel computing 2008; 34(4–5): 201–216.

[25]

Wang PH, Collins JD, et al. Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press: New York, USA, 2007; 156–166.

[26]

Kumar R, Tullsen DM, et al. Heterogeneous chip multiprocessors. Computer 2005; 38(11): 32–38.

[27]

Kunzman DM, Kalé LV. Towards a framework for abstracting accelerators in parallel applications: experience with cell. In SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM: New York, NY, USA, 2009; 1–12.

[28]

Nightingale EB, Hodson O, et al. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP ’09: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, ACM: New York, NY, USA, 2009; 221–234.

[29]

Baskaran MM, Bondhugula U, et al. A compiler framework for optimization of affine loop nests for GPGPUs. In International Conference on Supercomputing, 2008; 225–234.

[30]

Klockner A. PyCuda, 2008.

[31]

Lee S, Min S-J, et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization, 2008.

[32]

Sundaram N, Raghunathan A, et al. A framework for efficient and scalable execution of domain-specific templates on GPUs. In IPDPS, 2009.

Digital Library

[33]

AMD. AMD Stream SDK.

Cited By

Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Diogo MGrelck C(2012)Towards Heterogeneous Computing without Heterogeneous ProgrammingProceedings of the 2012 Conference on Trends in Functional Programming - Volume 782910.1007/978-3-642-40447-4_18(279-294)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1007/978-3-642-40447-4_18

Compiler and runtime support for enabling reduction computations on heterogeneous systems
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a ...
Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support

The heterogeneous system architecture (HSA), announced by the HSA Foundation, is an approach to integrate central processing unit (CPU) and graphics processing unit (GPU) architectures. The open computing language (OpenCL) is a programming framework ...
A compiler and runtime system for enabling data mining applications on gpus
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience

Concurrency and Computation: Practice & Experience Volume 24, Issue 5

April 2012

111 pages

ISSN:1532-0626

EISSN:1532-0634

Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 01 April 2012

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Diogo MGrelck C(2012)Towards Heterogeneous Computing without Heterogeneous ProgrammingProceedings of the 2012 Conference on Trends in Functional Programming - Volume 782910.1007/978-3-642-40447-4_18(279-294)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1007/978-3-642-40447-4_18

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents