research-article

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Authors:

Naoya Maruyama,

Satoshi MatsuokaAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 11, Pages 1 - 12

https://doi.org/10.1145/2063384.2063398

Published: 12 November 2011 Publication History

Abstract

This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.

References

[1]

E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. S. Jr., and S. Tobin-Hochstadt. The fortress language specification, March 2008.

[2]

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. D. Kubiatowicz, E. A. Lee, N. Morgan, G. Necula, D. A. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. A. Yelick. The Parallel Computing Laboratory at U. C. Berkeley: A Research Agenda Based on the Berkeley View. Technical Report UCB/EECS-2008-23, EECS Department, University of California, Berkeley, Mar. 2008.

[3]

S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhäuser Press, 1997.

Digital Library

[4]

H. P. Z. Bradford L. Chamberlain, David Callahan. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3):291--312, 2007.

Digital Library

[5]

W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to upc and language specification. Technical report, IDA Center for Computing Sciences, 1999.

[6]

H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth, P. Hanrahan, M. Odersky, and K. Olukotun. Language virtualization for heterogeneous parallel computing. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications, OOPSLA '10, pages 835--847, 2010.

Digital Library

[7]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40:519--538, October 2005.

Digital Library

[8]

J. M. Cohen and J. Molemake. A Fast Double Precision CFD Code Using CUDA. In 21st International Conference on Parallel Computational Fluid Dynamics (ParCFD2009), 2009.

[9]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, pages 1--12, 2008.

Digital Library

[10]

M. Fowler. Domain-Specific Languages. Addison-Wesley Professional, 2010.

Digital Library

[11]

M. Frigo and S. G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, Feb. 2005.

[12]

L. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka. Low-overhead diskless checkpoint for hybrid computing systems. In High Performance Computing (HiPC), 2010 International Conference on, pages 1--10, dec. 2010.

[13]

T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM.

Digital Library

[14]

R. Himeno. Himeno benchmark. http://accc.riken.jp/HPC_e/himenobmt_e.html, July 2011.

[15]

K. Kennedy, C. Koelbel, and H. Zima. The rise and fall of High Performance Fortran: an historical object lesson. In HOPL III: Proceedings of the third ACM SIGPLAN conference on History of programming languages, New York, NY, USA, 2007. ACM.

Digital Library

[16]

T. Kim. Hardware-aware analysis and optimization of stable fluids. In Proceedings of the 2008 symposium on Interactive 3D graphics and games, I3D '08, pages 99--106, 2008.

Digital Library

[17]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5:308--323, September 1979.

Digital Library

[18]

N. Maruyama, A. Nukada, and S. Matsuoka. A high-performance fault-tolerant software framework for memory on commodity gpus. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, april 2010.

[19]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, 2010.

Digital Library

[20]

R. W. Numrich and J. Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17:1--31, August 1998.

Digital Library

[21]

OpenCFD. OpenFoam user guide. http://www.openfoam.com/docs/, 2011.

[22]

D. A. Orchard, M. Bolingbroke, and A. Mycroft. Ypnos: declarative, parallel structured grid programming. In DAMP '10: Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming, pages 15--24, 2010.

Digital Library

[23]

E. Phillips and M. Fatica. Implementing the Himeno Benchmark with CUDA on GPU Clusters. In IEEE International Parallel & Distributed Processing Symposium, pages 1--10, Apr. 2010.

[24]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-M. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.

Digital Library

[25]

H. Sakagami, H. Murai, Y. Seo, and M. Yokokawa. 14.9 TFLOPS Three-Dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator. SC Conference, 0:51+, 2002.

Digital Library

[26]

M. Schordan and D. Quinlan. A source-to-source architecture for user-defined optimizations. In L. BÃűszÃűrmÃl'nyi and P. Schojer, editors, Modular Programming Languages, volume 2789 of Lecture Notes in Computer Science, pages 214--223. Springer Berlin/Heidelberg, 2003.

[27]

T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 tflops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, 2010.

Digital Library

[28]

D. Unat, X. Cai, and S. B. Baden. Mint: realizing CUDA performance in 3D stencil methods with annotated c. In Proceedings of the International Conference on Supercomputing (ICS'11), ICS '11, pages 214--224, 2011.

Digital Library

[29]

S. Venkatasubramanian, R. W. Vuduc, and N. None. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 244--255, New York, NY, USA, 2009. ACM.

Digital Library

[30]

M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM.

Digital Library

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Bisbas GLydike ABauer EBrown NFehr MMitchell LRodriguez-Canal GJamieson MKelly PSteuwer MGrosser TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651344
Show More Cited By

Index Terms

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
      2. Language types
        Parallel programming languages

Recommendations

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of ...
Beyond parallel programming with domain specific languages
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Today, almost all computer architectures are parallel and heterogeneous; a combination of multiple CPUs, GPUs and specialized processors. This creates a challenging problem for application developers who want to develop high performance programs without ...
Fusing convolution kernels through tiling
ARRAY 2015: Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

Image processing pipelines are continuously being developed to deduce more information about objects captured in images. To facilitate the development of such pipelines several Domain Specific Languages (DSLs) have been proposed that provide constructs ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Education, Culture, Sports, Science, and Technology

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

115
Total Citations
View Citations
809
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Chen YLi KWang YBai DWang LMa LYuan LZhang YCao TYang MLee IChabbi MSteuwer M(2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638476
Bisbas GLydike ABauer EBrown NFehr MMitchell LRodriguez-Canal GJamieson MKelly PSteuwer MGrosser TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651344
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Mohamed KMohamed K(2024)An Introduction to Heterogeneous SoC Design and Verification “A Conceptual-Level”Heterogeneous SoC Design and Verification10.1007/978-3-031-56152-8_1(1-26)Online publication date: 23-Mar-2024
https://doi.org/10.1007/978-3-031-56152-8_1
Ibrahim AAlsultan AElrabaa MEl-Maleh ATonellot T(2023)Efficient Implementation of Reverse Time Migration Seismic Imaging on FPGAsDay 2 Mon, February 20, 202310.2118/213299-MSOnline publication date: 7-Mar-2023
https://doi.org/10.2118/213299-MS
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Wang XLi YGuo FXu YLui J(2023)Dynamic GPU Scheduling with Multi-resource Awareness and Live Migration SupportIEEE Transactions on Cloud Computing10.1109/TCC.2023.3264242(1-16)Online publication date: 2023
https://doi.org/10.1109/TCC.2023.3264242
Mambu KCharles HKooli M(2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
https://doi.org/10.1109/TCAD.2023.3258346
Li MLiu YChen BYang HLuan ZQian D(2023)Building a domain-specific compiler for emerging processors with a reusable approachScience China Information Sciences10.1007/s11432-022-3727-667:1Online publication date: 27-Dec-2023
https://doi.org/10.1007/s11432-022-3727-6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents