Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2523721.2523771acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Generating efficient data movement code for heterogeneous architectures with distributed-memory

Published: 07 October 2013 Publication History

Abstract

Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11x to 83x over state-of-the-art, translating into a mean execution time speedup of 1.53x. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4x to 63.5x over state-of-the-art, resulting in a mean speedup of 1.55x. In addition, our scheme yields a mean speedup of 2.19x over hand-optimized UPC codes.

References

[1]
"C++ AMP," http://msdn.microsoft.com/en-us/library/hh265137.aspx.
[2]
"Portland Group Inc. Application Programming Interface," http://www.pgroup.com.
[3]
R. Dolbeau, S. Bihan and F. Bodin, "HMPP: A hybrid multi-core parallel programming environment," in GPGPU, 2007.
[4]
O. Kwon, F. Jubair, R. Eigenmann, and S. Midkiff, "A hybrid approach of OpenMP for clusters," in PPoPP, 2012, pp. 75--84.
[5]
S. P. Amarasinghe and M. S. Lam, "Communication optimization and code generation for distributed memory machines," in PLDI, 1993, pp. 126--138.
[6]
V. Adve and J. Mellor-Crummey, "Using integer sets for data-parallel program analysis and optimization," in PLDI, 1998, pp. 186--198.
[7]
D. Chavarría-Miranda and J. Mellor-Crummey, "Effective communication coalescing for data-parallel applications," in PPoPP, 2005, pp. 14--25.
[8]
M. Claßen and M. Griebl, "Automatic code generation for distributed memory architectures in the polytope model," in IPDPS, 2006.
[9]
U. Bondhugula, "Automatic distributed memory code generation using the polyhedral framework," Indian Institute of Science, Tech. Rep. IISc- CSA-TR-2011-3, 2011.
[10]
J. Kim, H. Kim, J. H. Lee, and J. Lee, "Achieving a single compute device image in OpenCL for multiple GPUs," in PPoPP, 2011, pp. 277--288.
[11]
C. Bastoul, "Clan: The Chunky Loop Analyzer," the Clan User guide.
[12]
"PolyLib - A library of polyhedral functions," http://icps.u-strasbg.fr/polylib/.
[13]
W. Pugh, "The omega test: a fast and practical integer programming algorithm for dependence analysis," Communications of the ACM, vol. 8, pp. 102--114, Aug. 1992.
[14]
S. Verdoolaege, "Integer Set Library," an integer set library for program analysis.
[15]
M. Griebl, Automatic Parallelization of Loop Programs for Distributed Memory Architectures. University of Passau, 2004, habilitation thesis.
[16]
"PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores," http://pluto-compiler.sourceforge.net.
[17]
C. Bastoul, "Code generation in the polyhedral model is easier than you think," in PACT, 2004, pp. 7--16.
[18]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in PLDI, 2008, pp. 101--113.
[19]
V. Bandishti, I. Pananilath, and U. Bondhugula, "Tiling stencil computations to maximize parallelism," in SC, 2012, pp. 40:1--40:11.
[20]
"Berkeley UPC - Unified Parallel C," http://upc.lbl.gov.
[21]
"Polybench," http://polybench.sourceforge.net.
[22]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson, "The pochoir stencil compiler," in SPAA, 2011, pp. 117--128.
[23]
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin, "A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction," in GPGPU, 2010, pp. 51--61.
[24]
F. Song and J. Dongarra, "A scalable framework for heterogeneous GPU-based clusters," in SPAA, 2012, pp. 91--100.
[25]
B. Meister, N. Vasilache, D. Wohlford, M. Baskaran, A. Leung, and R. Lethin, "R-Stream Compiler," in Encyclopedia of Parallel Computing, 2011, pp. 1756--1765.
[26]
M. Wolfe, "Implementing the PGI Accelerator model," in GPGPU, 2010, pp. 43--50.
[27]
T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, "Automatic CPU-GPU communication management and optimization," in PLDI, 2011, pp. 142--151.
[28]
T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August, "Dynamically managed data for CPU-GPU architectures," in CGO, 2012, pp. 165--174.
[29]
S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil, "Fast and efficient automatic memory management for GPUs using compiler-assisted run-time coherence scheme," in PACT, 2012, pp. 33--42.

Cited By

View all
  • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
  • (2021)AutoSAThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439292(93-104)Online publication date: 17-Feb-2021
  • (2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
  • Show More Cited By

Index Terms

  1. Generating efficient data movement code for heterogeneous architectures with distributed-memory

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
    October 2013
    422 pages
    ISBN:9781479910212

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 07 October 2013

    Check for updates

    Author Tags

    1. communication optimization
    2. data movement
    3. distributed memory
    4. heterogeneous architectures
    5. polyhedral model

    Qualifiers

    • Research-article

    Acceptance Rates

    PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
    • (2021)AutoSAThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439292(93-104)Online publication date: 17-Feb-2021
    • (2018)ControllersInternational Journal of High Performance Computing Applications10.1177/109434201770296232:6(838-853)Online publication date: 1-Nov-2018
    • (2018)Atributed consistent hashing for heterogeneous storage systemsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243202(1-12)Online publication date: 1-Nov-2018
    • (2017)Optimizing geometric multigrid method computation using a DSL approachProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126968(1-13)Online publication date: 12-Nov-2017
    • (2017)Maximizing Communication---Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective OperationsInternational Journal of Parallel Programming10.1007/s10766-016-0477-745:6(1390-1416)Online publication date: 1-Dec-2017
    • (2017)PandaInternational Journal of Parallel Programming10.1007/s10766-016-0454-145:3(711-729)Online publication date: 1-Jun-2017
    • (2016)Distributed HalideACM SIGPLAN Notices10.1145/3016078.285115751:8(1-12)Online publication date: 27-Feb-2016
    • (2016)Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed MemoryACM Transactions on Parallel Computing10.1145/29489753:2(1-28)Online publication date: 20-Jul-2016
    • (2016)Distributed HalideProceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/2851141.2851157(1-12)Online publication date: 27-Feb-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media