research-article

Generating efficient data movement code for heterogeneous architectures with distributed-memory

Authors:

Roshan Dathathri,

Thejas Ramashekar,

Uday BondhugulaAuthors Info & Claims

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Pages 375 - 386

Published: 07 October 2013 Publication History

Abstract

Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11x to 83x over state-of-the-art, translating into a mean execution time speedup of 1.53x. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4x to 63.5x over state-of-the-art, resulting in a mean speedup of 1.55x. In addition, our scheme yields a mean speedup of 2.19x over hand-optimized UPC codes.

References

[1]

"C++ AMP," http://msdn.microsoft.com/en-us/library/hh265137.aspx.

[2]

"Portland Group Inc. Application Programming Interface," http://www.pgroup.com.

[3]

R. Dolbeau, S. Bihan and F. Bodin, "HMPP: A hybrid multi-core parallel programming environment," in GPGPU, 2007.

[4]

O. Kwon, F. Jubair, R. Eigenmann, and S. Midkiff, "A hybrid approach of OpenMP for clusters," in PPoPP, 2012, pp. 75--84.

Digital Library

[5]

S. P. Amarasinghe and M. S. Lam, "Communication optimization and code generation for distributed memory machines," in PLDI, 1993, pp. 126--138.

Digital Library

[6]

V. Adve and J. Mellor-Crummey, "Using integer sets for data-parallel program analysis and optimization," in PLDI, 1998, pp. 186--198.

Digital Library

[7]

D. Chavarría-Miranda and J. Mellor-Crummey, "Effective communication coalescing for data-parallel applications," in PPoPP, 2005, pp. 14--25.

Digital Library

[8]

M. Claßen and M. Griebl, "Automatic code generation for distributed memory architectures in the polytope model," in IPDPS, 2006.

[9]

U. Bondhugula, "Automatic distributed memory code generation using the polyhedral framework," Indian Institute of Science, Tech. Rep. IISc- CSA-TR-2011-3, 2011.

[10]

J. Kim, H. Kim, J. H. Lee, and J. Lee, "Achieving a single compute device image in OpenCL for multiple GPUs," in PPoPP, 2011, pp. 277--288.

Digital Library

[11]

C. Bastoul, "Clan: The Chunky Loop Analyzer," the Clan User guide.

[12]

"PolyLib - A library of polyhedral functions," http://icps.u-strasbg.fr/polylib/.

[13]

W. Pugh, "The omega test: a fast and practical integer programming algorithm for dependence analysis," Communications of the ACM, vol. 8, pp. 102--114, Aug. 1992.

Digital Library

[14]

S. Verdoolaege, "Integer Set Library," an integer set library for program analysis.

[15]

M. Griebl, Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, 2004, habilitation thesis.

[16]

"PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores," http://pluto-compiler.sourceforge.net.

[17]

C. Bastoul, "Code generation in the polyhedral model is easier than you think," in PACT, 2004, pp. 7--16.

Digital Library

[18]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in PLDI, 2008, pp. 101--113.

Digital Library

[19]

V. Bandishti, I. Pananilath, and U. Bondhugula, "Tiling stencil computations to maximize parallelism," in SC, 2012, pp. 40:1--40:11.

Digital Library

[20]

"Berkeley UPC - Unified Parallel C," http://upc.lbl.gov.

[21]

"Polybench," http://polybench.sourceforge.net.

[22]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson, "The pochoir stencil compiler," in SPAA, 2011, pp. 117--128.

Digital Library

[23]

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin, "A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction," in GPGPU, 2010, pp. 51--61.

Digital Library

[24]

F. Song and J. Dongarra, "A scalable framework for heterogeneous GPU-based clusters," in SPAA, 2012, pp. 91--100.

Digital Library

[25]

B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Leung, and R. Lethin, "R-Stream Compiler," in Encyclopedia of Parallel Computing, 2011, pp. 1756--1765.

[26]

M. Wolfe, "Implementing the PGI Accelerator model," in GPGPU, 2010, pp. 43--50.

Digital Library

[27]

T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, "Automatic CPU-GPU communication management and optimization," in PLDI, 2011, pp. 142--151.

Digital Library

[28]

T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August, "Dynamically managed data for CPU-GPU architectures," in CGO, 2012, pp. 165--174.

Digital Library

[29]

S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil, "Fast and efficient automatic memory management for GPUs using compiler-assisted run-time coherence scheme," in PACT, 2012, pp. 33--42.

Digital Library

Cited By

Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Wang JGuo LCong JShannon LAdler M(2021)AutoSAThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439292(93-104)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3431920.3439292
Moreton-Fernandez AOrtega-Arranz HGonzalez-Escribano A(2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1177/1094342017702962
Show More Cited By

Index Terms

Generating efficient data movement code for heterogeneous architectures with distributed-memory
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Compiling affine loop nests for distributed-memory parallel architectures
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates ...
A Framework for Exploiting Task and Data Parallelism on Distributed Memory Multicomputers

Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all ...
Multigrain shared memory

Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

October 2013

422 pages

ISBN:9781479910212

Conference Chair:
Christian Fensch
University of Edinburgh, UK
,
General Chair:
Michael O'Boyle
University of Edinburgh, UK
,
Program Chairs:
André Seznec
INRIA Rennes, France
,
François Bodin
IRISA/CAPS Entreprise, France

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

IEEE Press

Publication History

Published: 07 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
270
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Wang JGuo LCong JShannon LAdler M(2021)AutoSAThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439292(93-104)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3431920.3439292
Moreton-Fernandez AOrtega-Arranz HGonzalez-Escribano A(2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1177/1094342017702962
Zhou JChen YWang WEvripidou SStenström PO'Boyle M(2018)Atributed consistent hashing for heterogeneous storage systemsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243202(1-12)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243202
Vasista VNarasimhan KBhat SBondhugula UMohr BRaghavan P(2017)Optimizing geometric multigrid method computation using a DSL approachProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126968(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126968
Barigou YGabriel E(2017)Maximizing Communication---Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective OperationsInternational Journal of Parallel Programming10.1007/s10766-016-0477-745:6(1390-1416)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10766-016-0477-7
Sourouri MBaden SCai X(2017)PandaInternational Journal of Parallel Programming10.1007/s10766-016-0454-145:3(711-729)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10766-016-0454-1
Denniston TKamil SAmarasinghe S(2016)Distributed HalideACM SIGPLAN Notices10.1145/3016078.285115751:8(1-12)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/3016078.2851157
Dathathri RMullapudi RBondhugula U(2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
https://dl.acm.org/doi/10.1145/2948975
Denniston TKamil SAmarasinghe SAsenjo RHarris T(2016)Distributed HalideProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851157(1-12)Online publication date: 27-Feb-2016
https://dl.acm.org/doi/10.1145/2851141.2851157
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents