Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2493123.2462915acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

On the efficacy of GPU-integrated MPI for scientific applications

Published: 17 June 2013 Publication History

Abstract

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU memories, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2-GPU, have been developed that provide unified programming interfaces and optimized implementations for end-to-end data communication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated communication systems for scientific applications.
In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depending on the application's computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the overall cluster utilization. Furthermore, MPI-ACC internally implements scalable memory management techniques, thereby decoupling the low-level memory optimizations from the applications and making them scalable and portable across several architectures. Experimental results from a state-of-the-art cluster with hundreds of GPUs show that the MPI-ACC--driven new application-specific optimizations can improve the performance of an epidemiology simulation by up to 61.6% and the performance of a seismology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer productivity and has the potential to improve the performance and portability of scientific applications, thus making a significant step toward GPUs being 'first-class citizens' of hybrid CPU-GPU clusters.

References

[1]
MPICH2. http://www.mcs.anl.gov/research/projects/mpich2/.
[2]
NVIDIA CUDA toolkit 4.1. http://developer.nvidia.com/cuda-toolkit-41.
[3]
NVIDIA GPUDirect. http://developer.nvidia.com/gpudirect.
[4]
TOP500. http://www.top500.org/lists/2012/06/highlights.
[5]
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., and Tallent, N. R. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency Computation: Practice and Experience 22 (April 2010), 685--701.
[6]
Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-c., Bisset, K. R., and Thakur, R. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems. In 14th IEEE International Conference on High Performance Computing and Communications (Liverpool, UK, June 2012).
[7]
Barrett, C. L., Bisset, K. R., Eubank, S. G., Feng, X., and Marathe, M. V. EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008), SC '08.
[8]
Berenger, J. A perfectly matched layer for the absorption of electromagnetic waves. Journal of Computational Physics 114, 2 (1994), 185--200.
[9]
Bisset, K., Aji, A., Marathe, M., and chun Feng, W. High-performance biocomputing for simulating the spread of contagion over large contact networks. In IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) (Feb. 2011), pp. 26--32.
[10]
Brown, W. M., Wang, P., Plimpton, S. J., and Tharrington, A. N. Implementing molecular dynamics on hybrid high performance computers -- short range forces. Computer Physics Communications 182, 4 (2011), 898--911.
[11]
Collino, F., and Tsogka, C. Application of the perfectly matched absorbing layer model to the linear elastodynamic problem in anisotropic heterogeneous media. Geophysics 66, 1 (2001), 294--307.
[12]
Endo, T., Nukada, A., Matsuoka, S., and Maruyama, N. Linpack evaluation on a supercomputer with heterogeneous accelerators. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (April 2010), pp. 1--8.
[13]
Feng, W.-C., Cao, Y., Patnaik, D., and Ramakrishnan, N. Temporal Data Mining for Neuroscience. In GPU Computing Gems, W. mei W. Hwu, Ed. Morgan Kaufmann, February 2011. Emerald Edition.
[14]
Festa, G., and Nielsen, S. PML absorbing boundaries. Bulletin of the Seismological Society of America 93, 2 (2003), 891--903.
[15]
Group, K. OpenCL 1.2. http://www.khronos.org/opencl/.
[16]
Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., and Taiji, M. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009), ACM.
[17]
Jenkins, J., Dinan, J., Balaji, P., Samatova, N. F., and Thakur, R. Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI
[18]
GPU Environments. In IEEE International Conference on Cluster Computing (Cluster) (September 2012).
[19]
Ji, F., Aji, A., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-C., and Ma, X. Efficient Intranode Communication in GPU-Accelerated Systems. In The 2nd Intl. Workshop on Accelerators and Hybrid Exascale Systems (May 2012).
[20]
Ji, F., Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Thakur, R., Feng, W.-c., and Ma, X. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. In 14th IEEE International Conference on High Performance Computing and Communications (Liverpool, UK, June 2012).
[21]
Joseph, R., Ravunnikutty, G., Ranka, S., D'Azevedo, E., and Klasky, S. Efficient GPU Implementation for Particle in Cell Algorithm. In IEEE International Parallel Distributed Processing Symposium (IPDPS) (May 2011), pp. 395--406.
[22]
Lawlor, O. Message passing for GPGPU clusters: CudaMPI. In IEEE International Conference on Cluster Computing and Workshops, 2009. CLUSTER '09. (Sept. 2009), pp. 1--8.
[23]
Liu, J., Jiang, W., Wyckoff, P., Panda, D., Ashton, D., Buntinas, D., Gropp, W., and Toonen, B. Design and implementation of MPICH2 over InfiniBand with RDMA support. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International (April 2004), p. 16.
[24]
Ma, S., and Liu, P. Modeling of the perfectly matched layer absorbing boundaries and intrinsic attenuation in explicit finite-element methods. Bulletin of the Seismological Society of America 96, 5 (2006), 1779--1794.
[25]
Marcinkovich, C., and Olsen, K. On the implementation of perfectly matched layers in a three-dimensional fourth-order velocity-stress finite difference scheme. J. Geophys. Res. 108, B5 (2003), 2276.
[26]
Nere, A., Hashmi, A., and Lipasti, M. Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms. In IEEE International Parallel Distributed Processing Symposium (IPDPS), (May 2011), pp. 906--920.
[27]
Nvidia. NVIDIA CUDA C Programming Guide version 4.0.
[28]
Singh, A. K., Potluri, S., Wang, H., Kandalla, K., Sur, S., and Panda, D. K. MPI alltoall personalized exchange on GPGPU clusters: Design alternatives and benefit. In Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11 (Sept. 2011).
[29]
Wang, H., Potluri, S., Luo, M., Singh, A., Sur, S., and Panda, D. MVAPICH2-GPU: Optimized GPU to GPU communication for InfiniBand clusters. International Supercomputing Conference (ISC) '11 (2011).
[30]
Wang, H., Potluri, S., Luo, M., Singh, A. K., Ouyang, X., Sur, S., and Panda, D. K. Optimized non-contiguous MPI datatype communication for GPU clusters: Design, implementation and evaluation with MVAPICH2. In Proceedings of CLUSTER (2011), IEEE.
[31]
Weiguo, L., Schmidt, B., Voss, G., and Muller-Wittig, W. Streaming Algorithms for Biological Sequence Alignment on GPUs. IEEE Transactions on Parallel and Distributed Systems 18, 9 (Sept. 2007), 1270--1281.

Cited By

View all
  • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
  • (2023)Scalable Multi-node Fast Fourier Transform on GPUsSN Computer Science10.1007/s42979-023-02109-04:5Online publication date: 19-Aug-2023
  • (2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
  • Show More Cited By

Recommendations

Reviews

Chris Lupo

In high-performance computing (HPC) environments, heterogeneous compute systems comprised of a distributed cluster of multicore nodes with accelerators such as graphics processing units (GPUs) are becoming the norm. There are robust and mature programming models for distributed computing, as well as for GPU computing. However, it is only recently that hybrid programming models that integrate GPU computing capabilities with explicit message passing have been developed. Aji et al. characterize the performance and productivity of a specific GPU-integrated message passing interface (MPI) framework, MPI-ACC, in two scientific computing applications: an epidemiology simulation and a seismology modeling application. The primary contribution of their work is a detailed case study of two scientific computing applications in which a basic MPI+GPU model that is not integrated is compared to the GPU-integrated MPI-ACC framework. The authors describe the performance effects of using each cluster node's central processing unit (CPU) concurrently with the node's GPU, rather than using the GPU exclusively. They also evaluate the effects of various optimizations, such as data communication patterns to reduce communication overhead, and the use of data partitioning to increase concurrency and maximize GPU memory bandwidth. They evaluate these applications using HPCToolkit, and find that the GPU-integrated MPI framework generally outperforms the base MPI+GPU implementations for both applications. The results also show that the use of profiling tools such as HPCToolkit can expose problem areas that, when solved, can lead to significant performance improvements. This paper is thorough and well written, and should be of interest to application developers looking for a detailed analysis of how to use a GPU-integrated MPI framework to build and optimize scientific applications. The authors do not go into detail on GPU kernel implementations. Rather, they focus on the interaction between message passing with MPI and the CPU interface to the GPU. The MPI-ACC framework is the only model discussed, so it would be interesting to learn how the techniques applied in this paper translate to other GPU-integrated MPI frameworks. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
June 2013
276 pages
ISBN:9781450319102
DOI:10.1145/2493123
  • General Chairs:
  • Manish Parashar,
  • Jon Weissman,
  • Program Chairs:
  • Dick Epema,
  • Renato Figueiredo
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. MPI
  3. MPI-ACC
  4. computational epidemiology
  5. seismology

Qualifiers

  • Research-article

Conference

HPDC'13
Sponsor:

Acceptance Rates

HPDC '13 Paper Acceptance Rate 20 of 131 submissions, 15%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
  • (2023)Scalable Multi-node Fast Fourier Transform on GPUsSN Computer Science10.1007/s42979-023-02109-04:5Online publication date: 19-Aug-2023
  • (2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
  • (2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
  • (2018)Scalable Solutions for Automated Single Pulse Identification and Classification in Radio AstronomyProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225101(1-11)Online publication date: 13-Aug-2018
  • (2017)Supporting Energy-Efficient Computing on Heterogeneous CPU-GPU Architectures2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud)10.1109/FiCloud.2017.46(134-141)Online publication date: Aug-2017
  • (2017)SunwayMR: A distributed parallel computing framework with convenient data-intensive applications programmingFuture Generation Computer Systems10.1016/j.future.2017.01.01871(43-56)Online publication date: Jun-2017
  • (2016)MPI-ACCIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.244647927:5(1401-1414)Online publication date: 1-May-2016
  • (2016)LandrushProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.58(32-41)Online publication date: 16-May-2016
  • (2015)High performance computing of fiber scattering simulationProceedings of the 8th Workshop on General Purpose Processing using GPUs10.1145/2716282.2716285(90-98)Online publication date: 7-Feb-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media