research-article

Adapting a message-driven parallel application to GPU-accelerated clusters

Authors:

James C. Phillips,

Klaus SchultenAuthors Info & Claims

SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Article No.: 8, Pages 1 - 9

Published: 15 November 2008 Publication History

Abstract

Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.

References

[1]

J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten, "Accelerating molecular modeling applications with graphics processors," J. Comp. Chem., vol. 28, pp. 2618--2640, 2007.

[2]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, pp. 879--899, 2008.

[3]

J. A. Anderson, C. D. Lorenz, and A. Travesset, "General purpose molecular dynamics simulations fully implemented on graphics processing units," J. Chem. Phys., vol. 227, no. 10, pp. 5342--5359, 2008.

Digital Library

[4]

I. Ufimtsev and T. Martinez, "Quantum chemistry on graphical processing units. 1. strategies for two-electron integral evaluation," Journal of Chemical Theory and Computation, vol. 4, no. 2, pp. 222--231, 2008.

[5]

C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, and W.-M. W. Hwu, "GPU acceleration of cutoff pair potentials for molecular modeling applications," in CF '08: Proceedings of the 2008 conference on Computing frontiers. New York, NY, USA: ACM, 2008, pp. 273--282.

Digital Library

[6]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: Stream Computing on Graphics Hardware," in SIGGRAPH '04: ACM SIGGRAPH 2004 Papers. New York, NY, USA: ACM Press, 2004, pp. 777--786.

Digital Library

[7]

M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule, "Shader algebra," ACM Transactions on Graphics, vol. 23, no. 3, pp. 787--795, Aug. 2004.

Digital Library

[8]

M. Charalambous, P. Trancoso, and A. Stamatakis, "Initial experiences porting a bioinformatics application to a graphics processor." in Panhellenic Conference on Informatics, 2005, pp. 415--425.

Digital Library

[9]

D. R. Horn, M. Houston, and P. Hanrahan, "ClawHMMER: A streaming HMMer-search implementation," in SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. Washington, DC, USA: IEEE Computer Society, 2005, p. 11.

Digital Library

[10]

E. Elsen, V. Vishal, M. Houston, V. Pande, P. Hanrahan, and E. Darve, "N-body simulations on GPUs," Stanford University, Stanford, CA, Tech. Rep., Jun. 2007, http://arxiv.org/abs/0706.3060.

[11]

"NVIDIA CUDA Compute Unified Device Architecture Programming Guide," NVIDIA, NVIDIA, Santa Clara, CA, USA, 2007.

[12]

M. McCool, "Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform," in GSPx Multicore Applications Conference, Oct./Nov. 2006.

[13]

Advanced Micro Devices Inc., "Brook+ SC07 BOF session," in Supercomputing 2007 Conference, Nov. 2007.

[14]

J. Stratton, S. Stone, and W. mei Hwu, "MCUDA: An efficient implementation of CUDA kemels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01, March 2008. {Online}. Available: http://www.gigascale.org/pubs/1278.html

[15]

J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable parallel programming with CUDA," ACM Queue, vol. 6, no. 2, pp. 40--53, 2008.

Digital Library

[16]

Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, "GPU cluster for high performance computing," in SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. Washington, DC, USA: IEEE Computer Society, 2004, p. 47.

Digital Library

[17]

D. Göddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. H. M. Buijssen, M. Grajewski, and S. Turek, "Exploring weak scalability for FEM calculations on a GPU-enhanced cluster," Parallel Comput., vol. 33, no. 10--11, pp. 685--699, 2007.

Digital Library

[18]

J. N. Glosli, K. J. Caspersen, J. A. Gunnels, D. F. Richards, R. E. Rudd, and F. H. Streitz, "Extending stability beyond CPU millennium: A micron-scale atomistic simulation of Kelvin-Helmholtz instability," in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007.

Digital Library

[19]

J. W. Sheaffer, D. P. Luebke, and K. Skadron, "A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors," in GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on graphics hardware. Aire-la-Ville, Switzerland, Switzerland: Eurographics Association, 2007, pp. 55--64.

Digital Library

[20]

Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application-transparent checkpoint/restart for MPI programs over Infiniband," International Conference on Parallel Processing, 2006., pp. 471--478, Aug. 2006.

Digital Library

[21]

L. Kalé, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, N. Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan, and K. Schulten, "NAMD2: Greater scalability for parallel molecular dynamics," J. Comp. Phys., vol. 151, pp. 283--312, 1999.

Digital Library

[22]

J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten, "Scalable molecular dynamics with NAMD," J. Comp. Chem., vol. 26, pp. 1781--1802, 2005.

[23]

M. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. Kalé, R. Skeel, K. Schulten, and R. Kufrin, "MDScope - A visual computing environment for structural biology," Comput. Phys. Commun., vol. 91, no. 1--3, pp. 111--134, 1995.

[24]

M. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. Kalé, R. D. Skeel, and K. Schulten, "NAMD - A parallel, object-oriented molecular dynamics program," Int. J. Supercomp. Appl. High Perform. Comp., vol. 10, pp. 251--268, 1996.

Digital Library

[25]

J. Phillips, G. Zheng, S. Kumar, and L. Kale, "NAMD: Biomolecular simulation on thousands of processors." in Proceedings of the IEEE/ACM SC2002 Conference, Technical Paper 277. IEEE Press, 2002.

Digital Library

[26]

P. L. Freddolino, F. Liu, M. Gruebele, and K. Schulten, "Tenmicrosecond MD simulation of a fast-folding WW domain," Biophys. J., vol. 94, pp. L75-L77, 2008.

[27]

P. L. Freddolino, A. S. Arkhipov, S. B. Larson, A. McPherson, and K. Schulten, "Molecular dynamics simulations of the complete satellite tobacco mosaic virus," Structure, vol. 14, pp. 437--449, 2006.

[28]

L. V. Kale and S. Krishnan, "Charm++: Parallel Programming with Message-Driven Objects," in Parallel Programming using C++, G. V. Wilson and P. Lu, Eds. MIT Press, 1996, pp. 175--213.

[29]

L. V. Kale, E. Bohm, C. L. Mendes, T. Wilmarth, and G. Zheng, "Programming Petascale Applications with Charm++ and AMPI," in Petascale Computing: Algorithms and Applications, D. Bader, Ed. Chapman & Hall / CRC Press, 2008, pp. 421--441.

[30]

A. Gursoy and L. Kalé, "Performance and modularity benefits of messagedriven execution," Journal of Parallel and Distributed Computing, vol. 64, pp. 461--480, 2004.

Digital Library

[31]

S. Pakin, V. Karamcheti, and A. A. Chien, "Fast messages: Efficient, portable communication for workstation clusters and mpps," IEEE Parallel Distrib. Technol., vol. 5, no. 2, pp. 60--73, 1997.

Digital Library

[32]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, "Active messages: a mechanism for integrated communication and computation," SIGARCH Comput. Archit. News, vol. 20, no. 2, pp. 256--266, 1992.

Digital Library

[33]

L. V. Kalé, "The virtualization model of parallel programming: Runtime optimizations and the state of art," in LACSI 2002, Albuquerque, October 2002.

[34]

W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda, "Design of high performance MVAPICH2: MPI2 over InfiniBand," in Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID '06). Los Alamitos, CA, USA: IEEE Computer Society, 2006, pp. 43--48.

Digital Library

[35]

L. V. Kale, G. Zheng, C. W. Lee, and S. Kumar, "Scaling applications to massively parallel machines using projections performance analysis tool," in Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis, vol. 22, no. 3, February 2006, pp. 347--358.

Digital Library

Cited By

De Supinski BCasalino LDommer AGaieb ZBarros ESztain TAhn STrifan ABrace ABogetti AClyde AMa HLee HTurilli MKhalid SChong LSimmerling CHardy DMaia JPhillips JKurth TStern AHuang LMcCalpin JTatineni MGibbs TStone JJha SRamanathan AAmaro R(2021)AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamicsInternational Journal of High Performance Computing Applications10.1177/1094342021100645235:5(432-451)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1177/10943420211006452
Liu JRobson MQuinn TKulkarni MEigenmann RDing CMcKee S(2019)Efficient GPU tree walks for effective distributed n-body simulationsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330348(24-34)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330348
Acun BHardy DKale LLi KPhillips JStone J(2018)Scalable molecular dynamics with NAMD on the Summit systemIBM Journal of Research and Development10.1147/JRD.2018.288898662:6(4:1-4:9)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1147/JRD.2018.2888986
Show More Cited By

Index Terms

Adapting a message-driven parallel application to GPU-accelerated clusters

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Special issue on Community Analysis and Information Recommendation

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of ...
Hybrid Parallel Programming on GPU Clusters
ISPA '10: Proceedings of the International Symposium on Parallel and Distributed Processing with Applications

Nowadays, NVIDIA’s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

November 2008

739 pages

ISBN:9781424428359

Conference Chair:
Patricia Teller

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication History

Published: 15 November 2008

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC '08

Sponsor:

SIGARCH
IEEE-CS

SC '08: International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 21, 2008

Texas, Austin

Acceptance Rates

SC '08 Paper Acceptance Rate 59 of 277 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
1,224
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

De Supinski BCasalino LDommer AGaieb ZBarros ESztain TAhn STrifan ABrace ABogetti AClyde AMa HLee HTurilli MKhalid SChong LSimmerling CHardy DMaia JPhillips JKurth TStern AHuang LMcCalpin JTatineni MGibbs TStone JJha SRamanathan AAmaro R(2021)AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamicsInternational Journal of High Performance Computing Applications10.1177/1094342021100645235:5(432-451)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1177/10943420211006452
Liu JRobson MQuinn TKulkarni MEigenmann RDing CMcKee S(2019)Efficient GPU tree walks for effective distributed n-body simulationsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330348(24-34)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330348
Acun BHardy DKale LLi KPhillips JStone J(2018)Scalable molecular dynamics with NAMD on the Summit systemIBM Journal of Research and Development10.1147/JRD.2018.288898662:6(4:1-4:9)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1147/JRD.2018.2888986
Phillips JSanielevici S(2018)What You Should Know About NAMD and Charm++ But Were Hoping to IgnoreProceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity10.1145/3219104.3219134(1-6)Online publication date: 22-Jul-2018
https://dl.acm.org/doi/10.1145/3219104.3219134
Gysi TBär JHoefler TWest J(2016)dCUDAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014974(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014974
Bleuse RKedad-Sidhoum SMonna FMounié GTrystram D(2015)Scheduling independent tasks on multi-cores with GPU acceleratorsConcurrency and Computation: Practice & Experience10.1002/cpe.335927:6(1625-1638)Online publication date: 25-Apr-2015
https://dl.acm.org/doi/10.1002/cpe.3359
Choi JChandramowlishwaran AMadduri KVuduc RCavazos JGong XKaeli D(2014)A CPUProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2588768.2576787(64-71)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1145/2588768.2576787
Phillips JSun YJain NBohm EKalé LDamkroger TDongarra J(2014)Mapping to irregular torus topologies and other techniques for petascale biomolecular simulationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.12(81-91)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.12
Phillips JStone JVandivort KArmstrong TWozniak JWilde MSchulten KChen JEdelman AShen WTerrel A(2014)Petascale Tcl with NAMD, VMD, and Swift/TProceedings of the 1st First Workshop for High Performance Technical Computing in Dynamic Languages10.1109/HPTCDL.2014.7(6-17)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/HPTCDL.2014.7
Vasudevan RVadhiyar SKalé LMalony ANemirovsky MMidkiff S(2013)G-CharmProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465444(349-358)Online publication date: 10-Jun-2013
https://dl.acm.org/doi/10.1145/2464996.2465444
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten