Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM

Sreeram Potluri¹⁶,
Anshuman Goswami¹⁶,
Manjunath Gorentla Venkata¹⁷ &
…
Neena Imam¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10679))

Included in the following conference series:

Workshop on OpenSHMEM and Related Technologies

523 Accesses
2 Citations

Abstract

NVSHMEM is an implementation of OpenSHMEM for NVIDIA GPUs which allows communication to be issued from inside CUDA kernels. In this work, we present an implementation of Breadth First Search for multi-GPU systems using NVSHMEM. We analyze the benefits and bottlenecks of moving fine-grained communication into CUDA kernels. Using our implementation of BFS, we achieve up to 75% improvement in performance compared to a CUDA-aware MPI-based implementation, in the best case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 60.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Understanding GPU Triggering APIs for MPI+X Communication

Scalable Parallelization of Stencils Using MODA

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

References

http://graph500.org: Graph 500 benchmark specification 1.2 (2017). http://www.graph500.org/
Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. SIGPLAN Not. 47, 117–128 (2012)
Article Google Scholar
Bisson, M., Bernaschi, M., Mastrostefano, E.: Parallel distributed breadth first search on the Kepler architecture. CoRR abs/1408.1605 (2014)
Google Scholar
Potluri, S., Rossetti, D., Becker, D., Poole, D., Gorentla Venkata, M., Hernandez, O., Shamis, P., Lopez, M.G., Baker, M., Poole, W.: Exploring openSHMEM model to program GPU-based extreme-scale systems. In: Gorentla Venkata, M., Shamis, P., Imam, N., Lopez, M.G. (eds.) OpenSHMEM 2014. LNCS, vol. 9397, pp. 18–35. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26428-8_2
Chapter Google Scholar
NVIDIA: GPUDirect (2015). https://developer.nvidia.com/gpudirect
NVIDIA: GPUDirect RDMA (2015). http://docs.nvidia.com/cuda/gpudirect-rdma
Rossetti, D.: GPUDirect: integrating the GPU with a network interface. In: GPU Technology Conference (2015)
Google Scholar
Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., Panda, D.K.: MVAPICH2-GPU: optimized GPU to GPU communication for infiniband clusters. Comput. Sci. 26, 257–266 (2011)
Google Scholar
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for infiniband clusters with NVIDIA GPUs. In: Proceedings of the 2013 42nd International Conference on Parallel Processing, ICPP 2013, Washington, DC, USA, pp. 80–89. IEEE Computer Society (2013)
Google Scholar
MVAPICH: MPI over infiniband, 10GigE/iWARP and RoCE (2015). http://mvapich.cse.ohio-state.edu
Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.C., Bisset, K.R., Thakur, R.: MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: 14th IEEE International Conference on High Performance Computing and Communications, Liverpool, UK (2012)
Google Scholar
Potluri, S., Bureddy, D., Wang, H., Subramoni, H., Panda, D.K.: Extending openSHMEM for GPU computing. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS 2013, Washington, DC, USA, pp. 1001–1012. IEEE Computer Society (2013)
Google Scholar
Cunningham, D., Bordawekar, R., Saraswat, V.: GPU programming in a high level language: compiling X10 to CUDA. In: Proceedings of the 2011 ACM SIGPLAN X10 Workshop, X10 2011, pp. 8:1–8:10. ACM, New York (2011)
Google Scholar
Miyoshi, T., Irie, H., Shima, K., Honda, H., Kondo, M., Yoshinaga, T.: Flat: a GPU programming framework to provide embedded MPI. In: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pp. 20–29. ACM, New York (2012)
Google Scholar
Ueno, K., Suzumura, T.: Parallel distributed breadth first search on GPU. In: 20th Annual International Conference on High Performance Computing, HiPC 2013, Bengaluru (Bangalore), Karnataka, India, 18–21 December 2013, pp. 314–323 (2013)
Google Scholar
Matsuoka, S.: Making TSUBAME2.0, the world’s greenest production supercomputer, even greener: challenges to the architects. In: Proceedings of the 2011 International Symposium on Low Power Electronics and Design, Fukuoka, Japan, 1–3 August 2011, pp. 367–368 (2011)
Google Scholar
Bisson, M., Bernaschi, M., Mastrostefano, E.: Parallel distributed breadth first search on the Kepler architecture. IEEE Trans. Parallel Distrib. Syst. 27, 2091–2102 (2016)
Article Google Scholar
Pan, Y., Wang, Y., Wu, Y., Yang, C., Owens, J.D.: Multi-GPU graph analytics. CoRR abs/1504.04804 (2015)
Google Scholar

Download references

Acknowledgments

This research is supported in part by Oak Ridge National Lab, subcontract #4000145249. We would like to thank M. Bisson et al., authors of the multi-GPU implementation of BFS we have used as the baseline in this paper [17]. They have shared their code and have supported this work.

Author information

Authors and Affiliations

NVIDIA Corporation, Santa Clara, USA
Sreeram Potluri & Anshuman Goswami
Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, USA
Manjunath Gorentla Venkata & Neena Imam

Authors

Sreeram Potluri
View author publications
You can also search for this author in PubMed Google Scholar
Anshuman Goswami
View author publications
You can also search for this author in PubMed Google Scholar
Manjunath Gorentla Venkata
View author publications
You can also search for this author in PubMed Google Scholar
Neena Imam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sreeram Potluri .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Manjunath Gorentla Venkata
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Neena Imam
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Swaroop Pophale

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Potluri, S., Goswami, A., Venkata, M.G., Imam, N. (2018). Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds) OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence. OpenSHMEM 2017. Lecture Notes in Computer Science(), vol 10679. Springer, Cham. https://doi.org/10.1007/978-3-319-73814-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-73814-7_6
Published: 10 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73813-0
Online ISBN: 978-3-319-73814-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding GPU Triggering APIs for MPI+X Communication

Scalable Parallelization of Stencils Using MODA

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding GPU Triggering APIs for MPI+X Communication

Scalable Parallelization of Stencils Using MODA

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation