Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3014904.3014974acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

dCUDA: hardware supported overlap of computation and communication

Published: 13 November 2016 Publication History

Abstract

Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.

References

[1]
A proof for the queuing formula: L = λw. Operations Research, 9(3):383--387, 1961.
[2]
Michael Baldauf, Axel Seifert, Jochen Frstner, Detlev Majewski, Matthias Raschendorfer, and Thorsten Reinhardt. Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities. Monthly Weather Review, 139(12):3887--3905, 2011.
[3]
R. Belli and T. Hoefler. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 871--881, May 2015.
[4]
Feras Daoud, Amir Watad, and Mark Silberstein. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS '16, pages 6:1--6:8, New York, NY, USA, 2016. ACM.
[5]
John M. Dawson. Particle simulation of plasmas. Rev. Mod. Phys., 55:403--447, Apr 1983.
[6]
José Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Peña, Enrique S. Quintana-Ortí, and Federico Silla. An efficient implementation of GPU virtualization in high performance clusters. In Proceedings of the 2009 International Conference on Parallel Processing, Euro-Par'09, pages 385--394, Berlin, Heidelberg, 2010. Springer-Verlag.
[7]
MPI Forum. MPI: A message-passing interface standard. version 3.1. http://www.mpi-forum.org, June 4th 2015.
[8]
P. Ghysels and W. Vanroose. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parallel Comput., 40(7):224--238, July 2014.
[9]
Dror Goldenberg. Co-desing architecture. http://slideshare.net/insideHPC/co-design-architecture-for-exascale, March 2016.
[10]
Richard L. Graham and Galen Shipman. MPI support for multi-core architectures: Optimized shared memory collectives. In Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 130--140, Berlin, Heidelberg, 2008. Springer-Verlag.
[11]
Tobias Gysi, Tobias Grosser, and Torsten Hoefler. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 177--186, New York, NY, USA, 2015. ACM.
[12]
Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. Remote memory access programming in MPI-3. ACM Trans. Parallel Comput., 2(2):9:1--9:26, June 2015.
[13]
J. B. White III and J. J. Dongarra. Overlapping computation and communication for advection on hybrid parallel computers. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 59--67, May 2011.
[14]
Cray Inc. Using the GNI and MAPP APIs. Ver. S-2446--52. http://docs.cray.com/, March 2014.
[15]
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.
[16]
Shigang Li, Torsten Hoefler, and Marc Snir. NUMA-aware shared-memory collective communication for MPI. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, HPDC '13, pages 85--96, New York, NY, USA, 2013. ACM.
[17]
Paulius Micikevicius. GPU performance analysis and optimization. http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf, 2012.
[18]
Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. FLAT: A GPU programming framework to provide embedded MPI. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 20--29, New York, NY, USA, 2012. ACM.
[19]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with CUDA. Queue, 6(2):40--53, March 2008.
[20]
L. Oden and H. Fröning. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--8, Sept 2013.
[21]
L. Oden, H. Fröning, and F. J. Pfreundt. Infiniband-verbs on GPU: A case study of controlling an infiniband network device from the GPU. In Parallel Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 976--983, May 2014.
[22]
E. H. Phillips and M. Fatica. Implementing the himeno benchmark with CUDA on GPU clusters. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10, April 2010.
[23]
J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--9, Nov 2008.
[24]
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Parallel Processing (ICPP), 2013 42nd International Conference on, pages 80--89, Oct 2013.
[25]
C. Reao, R. Mayo, E. S. Quintana-Ort, F. Silla, J. Duato, and A. J. Pea. Influence of InfiniBand FDR on the performance of remote GPU virtualization. In 2013 IEEE International Conference on Cluster Computing (CLUSTER), pages 1--8, Sept 2013.
[26]
Davide Rosetti. A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology. https://github.com/NVIDIA/gdrcopy, 2015.
[27]
T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka. An 80-fold speedup, 15.0 tflops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov 2010.
[28]
J. A. Stuart and J. D. Owens. Message passing on data-parallel architectures. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.
[29]
Jeff A. Stuart, Pavan Balaji, and John D. Owens. Extending MPI to accelerators. In Proceedings of the 1st Workshop on Architectures and Systems for Big Data, ASBD '11, pages 19--23, New York, NY, USA, 2011. ACM.
[30]
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 193--204, June 2014.
[31]
S. Xiao and W. C. Feng. Inter-block GPU communication via fast barrier synchronization. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--12, April 2010.

Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2016
1034 pages
ISBN:9781467388153
  • Conference Chair:
  • John West

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

  1. distributed memory
  2. gpu
  3. latency hiding
  4. programming model
  5. remote memory access

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media