Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2063384.2063398acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Published: 12 November 2011 Publication History

Abstract

This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small set of declarative constructs that allow the user to express stencil computations in a portable and implicitly parallel manner. Our framework translates the user-written code into actual implementation code in CUDA for GPU acceleration and MPI for node-level parallelization with automatic optimizations such as computation and communication overlapping. We demonstrate the feasibility of such automatic translations by implementing several structured grid applications in our framework. Experimental results on the TSUBAME2.0 GPU-based supercomputer show that the performance is comparable as hand-written code and good strong and weak scalability up to 256 GPUs.

References

[1]
E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. S. Jr., and S. Tobin-Hochstadt. The fortress language specification, March 2008.
[2]
K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. D. Kubiatowicz, E. A. Lee, N. Morgan, G. Necula, D. A. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. A. Yelick. The Parallel Computing Laboratory at U. C. Berkeley: A Research Agenda Based on the Berkeley View. Technical Report UCB/EECS-2008-23, EECS Department, University of California, Berkeley, Mar. 2008.
[3]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhäuser Press, 1997.
[4]
H. P. Z. Bradford L. Chamberlain, David Callahan. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3):291--312, 2007.
[5]
W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to upc and language specification. Technical report, IDA Center for Computing Sciences, 1999.
[6]
H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth, P. Hanrahan, M. Odersky, and K. Olukotun. Language virtualization for heterogeneous parallel computing. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications, OOPSLA '10, pages 835--847, 2010.
[7]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40:519--538, October 2005.
[8]
J. M. Cohen and J. Molemake. A Fast Double Precision CFD Code Using CUDA. In 21st International Conference on Parallel Computational Fluid Dynamics (ParCFD2009), 2009.
[9]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, pages 1--12, 2008.
[10]
M. Fowler. Domain-Specific Languages. Addison-Wesley Professional, 2010.
[11]
M. Frigo and S. G. Johnson. The Design and Implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, Feb. 2005.
[12]
L. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka. Low-overhead diskless checkpoint for hybrid computing systems. In High Performance Computing (HiPC), 2010 International Conference on, pages 1--10, dec. 2010.
[13]
T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1--12, New York, NY, USA, 2009. ACM.
[14]
R. Himeno. Himeno benchmark. http://accc.riken.jp/HPC_e/himenobmt_e.html, July 2011.
[15]
K. Kennedy, C. Koelbel, and H. Zima. The rise and fall of High Performance Fortran: an historical object lesson. In HOPL III: Proceedings of the third ACM SIGPLAN conference on History of programming languages, New York, NY, USA, 2007. ACM.
[16]
T. Kim. Hardware-aware analysis and optimization of stable fluids. In Proceedings of the 2008 symposium on Interactive 3D graphics and games, I3D '08, pages 99--106, 2008.
[17]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5:308--323, September 1979.
[18]
N. Maruyama, A. Nukada, and S. Matsuoka. A high-performance fault-tolerant software framework for memory on commodity gpus. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, april 2010.
[19]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, 2010.
[20]
R. W. Numrich and J. Reid. Co-array fortran for parallel programming. SIGPLAN Fortran Forum, 17:1--31, August 1998.
[21]
OpenCFD. OpenFoam user guide. http://www.openfoam.com/docs/, 2011.
[22]
D. A. Orchard, M. Bolingbroke, and A. Mycroft. Ypnos: declarative, parallel structured grid programming. In DAMP '10: Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming, pages 15--24, 2010.
[23]
E. Phillips and M. Fatica. Implementing the Himeno Benchmark with CUDA on GPU Clusters. In IEEE International Parallel & Distributed Processing Symposium, pages 1--10, Apr. 2010.
[24]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-M. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.
[25]
H. Sakagami, H. Murai, Y. Seo, and M. Yokokawa. 14.9 TFLOPS Three-Dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator. SC Conference, 0:51+, 2002.
[26]
M. Schordan and D. Quinlan. A source-to-source architecture for user-defined optimizations. In L. BÃűszÃűrmÃl'nyi and P. Schojer, editors, Modular Programming Languages, volume 2789 of Lecture Notes in Computer Science, pages 214--223. Springer Berlin/Heidelberg, 2003.
[27]
T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 tflops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, 2010.
[28]
D. Unat, X. Cai, and S. B. Baden. Mint: realizing CUDA performance in 3D stencil methods with annotated c. In Proceedings of the International Conference on Supercomputing (ICS'11), ICS '11, pages 214--224, 2011.
[29]
S. Venkatasubramanian, R. W. Vuduc, and N. None. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 244--255, New York, NY, USA, 2009. ACM.
[30]
M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM.

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. application framework
  2. domain specific languages
  3. high perforamnce computing

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
  • (2024)ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor CoresProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638476(333-347)Online publication date: 2-Mar-2024
  • (2024)A shared compilation stack for distributed-memory parallelism in stencil DSLsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651344(38-56)Online publication date: 27-Apr-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • (2024)An Introduction to Heterogeneous SoC Design and Verification “A Conceptual-Level”Heterogeneous SoC Design and Verification10.1007/978-3-031-56152-8_1(1-26)Online publication date: 23-Mar-2024
  • (2023)Efficient Implementation of Reverse Time Migration Seismic Imaging on FPGAsDay 2 Mon, February 20, 202310.2118/213299-MSOnline publication date: 7-Mar-2023
  • (2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
  • (2023)Dynamic GPU Scheduling with Multi-resource Awareness and Live Migration SupportIEEE Transactions on Cloud Computing10.1109/TCC.2023.3264242(1-16)Online publication date: 2023
  • (2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
  • (2023)Building a domain-specific compiler for emerging processors with a reusable approachScience China Information Sciences10.1007/s11432-022-3727-667:1Online publication date: 27-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media