research-article

On the efficacy of GPU-integrated MPI for scientific applications

Authors:

Feng Ji,

Rajeev ThakurAuthors Info & Claims

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Pages 191 - 202

https://doi.org/10.1145/2493123.2462915

Published: 17 June 2013 Publication History

Get Access

Abstract

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU memories, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2-GPU, have been developed that provide unified programming interfaces and optimized implementations for end-to-end data communication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated communication systems for scientific applications.

In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depending on the application's computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the overall cluster utilization. Furthermore, MPI-ACC internally implements scalable memory management techniques, thereby decoupling the low-level memory optimizations from the applications and making them scalable and portable across several architectures. Experimental results from a state-of-the-art cluster with hundreds of GPUs show that the MPI-ACC--driven new application-specific optimizations can improve the performance of an epidemiology simulation by up to 61.6% and the performance of a seismology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer productivity and has the potential to improve the performance and portability of scientific applications, thus making a significant step toward GPUs being 'first-class citizens' of hybrid CPU-GPU clusters.

References

[1]

MPICH2. http://www.mcs.anl.gov/research/projects/mpich2/.

Google Scholar

[2]

NVIDIA CUDA toolkit 4.1. http://developer.nvidia.com/cuda-toolkit-41.

Google Scholar

[3]

NVIDIA GPUDirect. http://developer.nvidia.com/gpudirect.

Google Scholar

[4]

TOP500. http://www.top500.org/lists/2012/06/highlights.

Google Scholar

[5]

Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., and Tallent, N. R. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency Computation: Practice and Experience 22 (April 2010), 685--701.

Crossref

Google Scholar

[6]

Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-c., Bisset, K. R., and Thakur, R. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems. In 14th IEEE International Conference on High Performance Computing and Communications (Liverpool, UK, June 2012).

Digital Library

Google Scholar

[7]

Barrett, C. L., Bisset, K. R., Eubank, S. G., Feng, X., and Marathe, M. V. EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008), SC '08.

Digital Library

Google Scholar

[8]

Berenger, J. A perfectly matched layer for the absorption of electromagnetic waves. Journal of Computational Physics 114, 2 (1994), 185--200.

Digital Library

Google Scholar

[9]

Bisset, K., Aji, A., Marathe, M., and chun Feng, W. High-performance biocomputing for simulating the spread of contagion over large contact networks. In IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) (Feb. 2011), pp. 26--32.

Digital Library

Google Scholar

[10]

Brown, W. M., Wang, P., Plimpton, S. J., and Tharrington, A. N. Implementing molecular dynamics on hybrid high performance computers -- short range forces. Computer Physics Communications 182, 4 (2011), 898--911.

Crossref

Google Scholar

[11]

Collino, F., and Tsogka, C. Application of the perfectly matched absorbing layer model to the linear elastodynamic problem in anisotropic heterogeneous media. Geophysics 66, 1 (2001), 294--307.

Crossref

Google Scholar

[12]

Endo, T., Nukada, A., Matsuoka, S., and Maruyama, N. Linpack evaluation on a supercomputer with heterogeneous accelerators. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on (April 2010), pp. 1--8.

Crossref

Google Scholar

[13]

Feng, W.-C., Cao, Y., Patnaik, D., and Ramakrishnan, N. Temporal Data Mining for Neuroscience. In GPU Computing Gems, W. mei W. Hwu, Ed. Morgan Kaufmann, February 2011. Emerald Edition.

Google Scholar

[14]

Festa, G., and Nielsen, S. PML absorbing boundaries. Bulletin of the Seismological Society of America 93, 2 (2003), 891--903.

Crossref

Google Scholar

[15]

Group, K. OpenCL 1.2. http://www.khronos.org/opencl/.

Google Scholar

[16]

Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., and Taiji, M. 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009), ACM.

Digital Library

Google Scholar

[17]

Jenkins, J., Dinan, J., Balaji, P., Samatova, N. F., and Thakur, R. Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI

Google Scholar

[18]

GPU Environments. In IEEE International Conference on Cluster Computing (Cluster) (September 2012).

Google Scholar

[19]

Ji, F., Aji, A., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-C., and Ma, X. Efficient Intranode Communication in GPU-Accelerated Systems. In The 2nd Intl. Workshop on Accelerators and Hybrid Exascale Systems (May 2012).

Digital Library

Google Scholar

[20]

Ji, F., Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Thakur, R., Feng, W.-c., and Ma, X. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. In 14th IEEE International Conference on High Performance Computing and Communications (Liverpool, UK, June 2012).

Digital Library

Google Scholar

[21]

Joseph, R., Ravunnikutty, G., Ranka, S., D'Azevedo, E., and Klasky, S. Efficient GPU Implementation for Particle in Cell Algorithm. In IEEE International Parallel Distributed Processing Symposium (IPDPS) (May 2011), pp. 395--406.

Digital Library

Google Scholar

[22]

Lawlor, O. Message passing for GPGPU clusters: CudaMPI. In IEEE International Conference on Cluster Computing and Workshops, 2009. CLUSTER '09. (Sept. 2009), pp. 1--8.

Crossref

Google Scholar

[23]

Liu, J., Jiang, W., Wyckoff, P., Panda, D., Ashton, D., Buntinas, D., Gropp, W., and Toonen, B. Design and implementation of MPICH2 over InfiniBand with RDMA support. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International (April 2004), p. 16.

Google Scholar

[24]

Ma, S., and Liu, P. Modeling of the perfectly matched layer absorbing boundaries and intrinsic attenuation in explicit finite-element methods. Bulletin of the Seismological Society of America 96, 5 (2006), 1779--1794.

Crossref

Google Scholar

[25]

Marcinkovich, C., and Olsen, K. On the implementation of perfectly matched layers in a three-dimensional fourth-order velocity-stress finite difference scheme. J. Geophys. Res. 108, B5 (2003), 2276.

Google Scholar

[26]

Nere, A., Hashmi, A., and Lipasti, M. Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms. In IEEE International Parallel Distributed Processing Symposium (IPDPS), (May 2011), pp. 906--920.

Digital Library

Google Scholar

[27]

Nvidia. NVIDIA CUDA C Programming Guide version 4.0.

Google Scholar

[28]

Singh, A. K., Potluri, S., Wang, H., Kandalla, K., Sur, S., and Panda, D. K. MPI alltoall personalized exchange on GPGPU clusters: Design alternatives and benefit. In Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11 (Sept. 2011).

Digital Library

Google Scholar

[29]

Wang, H., Potluri, S., Luo, M., Singh, A., Sur, S., and Panda, D. MVAPICH2-GPU: Optimized GPU to GPU communication for InfiniBand clusters. International Supercomputing Conference (ISC) '11 (2011).

Digital Library

Google Scholar

[30]

Wang, H., Potluri, S., Luo, M., Singh, A. K., Ouyang, X., Sur, S., and Panda, D. K. Optimized non-contiguous MPI datatype communication for GPU clusters: Design, implementation and evaluation with MVAPICH2. In Proceedings of CLUSTER (2011), IEEE.

Digital Library

Google Scholar

[31]

Weiguo, L., Schmidt, B., Voss, G., and Muller-Wittig, W. Streaming Algorithms for Biological Sequence Alignment on GPUs. IEEE Transactions on Parallel and Distributed Systems 18, 9 (Sept. 2007), 1270--1281.

Digital Library

Google Scholar

Cited By

View all

Chen HLin EChou YChou J(2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3119205
Verma MChatterjee SGarg GSharma BArya NKumar SSaxena AVerma M(2023)Scalable Multi-node Fast Fourier Transform on GPUsSN Computer Science10.1007/s42979-023-02109-04:5Online publication date: 19-Aug-2023
https://doi.org/10.1007/s42979-023-02109-0
Khairy MNikiforov VNellans DRogers T(2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00086
Show More Cited By

Index Terms

On the efficacy of GPU-integrated MPI for scientific applications

Recommendations

On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Boosting CUDA Applications with CPU---GPU Hybrid Computing

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...

Reviews

Reviewer: Chris Lupo

In high-performance computing (HPC) environments, heterogeneous compute systems comprised of a distributed cluster of multicore nodes with accelerators such as graphics processing units (GPUs) are becoming the norm. There are robust and mature programming models for distributed computing, as well as for GPU computing. However, it is only recently that hybrid programming models that integrate GPU computing capabilities with explicit message passing have been developed. Aji et al. characterize the performance and productivity of a specific GPU-integrated message passing interface (MPI) framework, MPI-ACC, in two scientific computing applications: an epidemiology simulation and a seismology modeling application. The primary contribution of their work is a detailed case study of two scientific computing applications in which a basic MPI+GPU model that is not integrated is compared to the GPU-integrated MPI-ACC framework. The authors describe the performance effects of using each cluster node's central processing unit (CPU) concurrently with the node's GPU, rather than using the GPU exclusively. They also evaluate the effects of various optimizations, such as data communication patterns to reduce communication overhead, and the use of data partitioning to increase concurrency and maximize GPU memory bandwidth. They evaluate these applications using HPCToolkit, and find that the GPU-integrated MPI framework generally outperforms the base MPI+GPU implementations for both applications. The results also show that the use of profiling tools such as HPCToolkit can expose problem areas that, when solved, can lead to significant performance improvements. This paper is thorough and well written, and should be of interest to application developers looking for a detailed analysis of how to use a GPU-integrated MPI framework to build and optimize scientific applications. The authors do not go into detail on GPU kernel implementations. Rather, they focus on the interaction between message passing with MPI and the CPU interface to the GPU. The MPI-ACC framework is the only model discussed, so it would be interesting to learn how the techniques applied in this paper translate to other GPU-integrated MPI frameworks. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

June 2013

276 pages

ISBN:9781450319102

DOI:10.1145/2493123

General Chairs:
Manish Parashar
Rutgers University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Renato Figueiredo
University of Florida, USA and Vrije Universiteit, The Netherlands

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'13

Sponsor:

University of Arizona
SIGARCH

HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing

June 17 - 21, 2013

New York, New York, USA

Acceptance Rates

HPDC '13 Paper Acceptance Rate 20 of 131 submissions, 15%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
344
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chen HLin EChou YChou J(2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3119205
Verma MChatterjee SGarg GSharma BArya NKumar SSaxena AVerma M(2023)Scalable Multi-node Fast Fourier Transform on GPUsSN Computer Science10.1007/s42979-023-02109-04:5Online publication date: 19-Aug-2023
https://doi.org/10.1007/s42979-023-02109-0
Khairy MNikiforov VNellans DRogers T(2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00086
Xie XLiang YLi XTan WWeissman JButt ASmirni E(2019)CuLDAProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325407(195-205)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325407
Devine TGoseva-Popstojanova KPang D(2018)Scalable Solutions for Automated Single Pulse Identification and Classification in Radio AstronomyProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225101(1-11)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225101
Siehl KZhao X(2017)Supporting Energy-Efficient Computing on Heterogeneous CPU-GPU Architectures2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud)10.1109/FiCloud.2017.46(134-141)Online publication date: Aug-2017
https://doi.org/10.1109/FiCloud.2017.46
Wu RHuang LYu PZhou H(2017)SunwayMR: A distributed parallel computing framework with convenient data-intensive applications programmingFuture Generation Computer Systems10.1016/j.future.2017.01.01871(43-56)Online publication date: Jun-2017
https://doi.org/10.1016/j.future.2017.01.018
Aji APanwar LJi FMurthy KChabbi MBalaji PBisset KDinan JFeng WMellor-Crummey JMa XThakur R(2016)MPI-ACCIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.244647927:5(1401-1414)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2446479
Goswami ATian YSchwan KZheng FYoung JWolf MEisenhauer GKlasky SVarela CCastro HBarrios C(2016)LandrushProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.58(32-41)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.58
Yu LZhang YGong XRoy NMakowski LKaeli DKaeli DCavazos J(2015)High performance computing of fiber scattering simulationProceedings of the 8th Workshop on General Purpose Processing using GPUs10.1145/2716282.2716285(90-98)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2716282.2716285
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations