research-article

dCUDA: hardware supported overlap of computation and communication

Authors:

Torsten HoeflerAuthors Info & Claims

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 52, Pages 1 - 12

Published: 13 November 2016 Publication History

Abstract

Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.

References

[1]

A proof for the queuing formula: L = λw. Operations Research, 9(3):383--387, 1961.

Digital Library

[2]

Michael Baldauf, Axel Seifert, Jochen Frstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Monthly Weather Review, 139(12):3887--3905, 2011.

[3]

R. Belli and T. Hoefler. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 871--881, May 2015.

Digital Library

[4]

Feras Daoud, Amir Watad, and Mark Silberstein. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '16, pages 6:1--6:8, New York, NY, USA, 2016. ACM.

Digital Library

[5]

John M. Dawson. Particle simulation of plasmas. Rev. Mod. Phys., 55:403--447, Apr 1983.

[6]

José Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Peña, Enrique S. Quintana-Ortí, and Federico Silla. An efficient implementation of GPU virtualization in high performance clusters. In Proceedings of the 2009 International Conference on Parallel Processing, Euro-Par'09, pages 385--394, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

[7]

MPI Forum. MPI: A message-passing interface standard. version 3.1. http://www.mpi-forum.org, June 4th 2015.

[8]

P. Ghysels and W. Vanroose. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parallel Comput., 40(7):224--238, July 2014.

Digital Library

[9]

Dror Goldenberg. Co-desing architecture. http://slideshare.net/insideHPC/co-design-architecture-for-exascale, March 2016.

[10]

Richard L. Graham and Galen Shipman. MPI support for multi-core architectures: Optimized shared memory collectives. In Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 130--140, Berlin, Heidelberg, 2008. Springer-Verlag.

Digital Library

[11]

Tobias Gysi, Tobias Grosser, and Torsten Hoefler. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 177--186, New York, NY, USA, 2015. ACM.

Digital Library

[12]

Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. Remote memory access programming in MPI-3. ACM Trans. Parallel Comput., 2(2):9:1--9:26, June 2015.

Digital Library

[13]

J. B. White III and J. J. Dongarra. Overlapping computation and communication for advection on hybrid parallel computers. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 59--67, May 2011.

Digital Library

[14]

Cray Inc. Using the GNI and MAPP APIs. Ver. S-2446--52. http://docs.cray.com/, March 2014.

[15]

Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.

Digital Library

[16]

Shigang Li, Torsten Hoefler, and Marc Snir. NUMA-aware shared-memory collective communication for MPI. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC '13, pages 85--96, New York, NY, USA, 2013. ACM.

Digital Library

[17]

Paulius Micikevicius. GPU performance analysis and optimization. http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf, 2012.

[18]

Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. FLAT: A GPU programming framework to provide embedded MPI. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 20--29, New York, NY, USA, 2012. ACM.

Digital Library

[19]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, March 2008.

Digital Library

[20]

L. Oden and H. Fröning. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--8, Sept 2013.

[21]

L. Oden, H. Fröning, and F. J. Pfreundt. Infiniband-verbs on GPU: A case study of controlling an infiniband network device from the GPU. In Parallel Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 976--983, May 2014.

Digital Library

[22]

E. H. Phillips and M. Fatica. Implementing the himeno benchmark with CUDA on GPU clusters. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10, April 2010.

[23]

J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--9, Nov 2008.

Digital Library

[24]

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 80--89, Oct 2013.

Digital Library

[25]

C. Reao, R. Mayo, E. S. Quintana-Ort, F. Silla, J. Duato, and A. J. Pea. Influence of InfiniBand FDR on the performance of remote GPU virtualization. In 2013 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--8, Sept 2013.

Digital Library

[26]

Davide Rosetti. A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology. https://github.com/NVIDIA/gdrcopy, 2015.

[27]

T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 tflops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov 2010.

Digital Library

[28]

J. A. Stuart and J. D. Owens. Message passing on data-parallel architectures. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.

Digital Library

[29]

Jeff A. Stuart, Pavan Balaji, and John D. Owens. Extending MPI to accelerators. In Proceedings of the 1st Workshop on Architectures and Systems for Big Data, ASBD '11, pages 19--23, New York, NY, USA, 2011. ACM.

Digital Library

[30]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 193--204, June 2014.

Digital Library

[31]

S. Xiao and W. C. Feng. Inter-block GPU communication via fast barrier synchronization. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010.

Cited By

LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LEvripidou SStenström PO'Boyle M(2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243179
LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LMohr BRaghavan P(2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126950

Recommendations

Efficient simulation of agent-based models on multi-GPU and multi-core clusters
SIMUTools '10: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-...
Accelerating Kirchhoff migration on GPU using directives
WACCPD '14: Proceedings of the First Workshop on Accelerator Programming using Directives

Accelerators offer the potential to significantly improve the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. However, effectively tapping their full potential is difficult owing to the ...
Performance portability study of epistasis detection using SYCL on NVIDIA GPU
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2016

1034 pages

ISBN:9781467388153

Conference Chair:
John West
University of Texas at Austin

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGARCH
IEEE-CS

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
248
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LEvripidou SStenström PO'Boyle M(2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243179
LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LMohr BRaghavan P(2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126950

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents