Abstract
NVSHMEM is an implementation of OpenSHMEM for NVIDIA GPUs which allows communication to be issued from inside CUDA kernels. In this work, we present an implementation of Breadth First Search for multi-GPU systems using NVSHMEM. We analyze the benefits and bottlenecks of moving fine-grained communication into CUDA kernels. Using our implementation of BFS, we achieve up to 75% improvement in performance compared to a CUDA-aware MPI-based implementation, in the best case.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
http://graph500.org: Graph 500 benchmark specification 1.2 (2017). http://www.graph500.org/
Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU graph traversal. SIGPLAN Not. 47, 117–128 (2012)
Bisson, M., Bernaschi, M., Mastrostefano, E.: Parallel distributed breadth first search on the Kepler architecture. CoRR abs/1408.1605 (2014)
Potluri, S., Rossetti, D., Becker, D., Poole, D., Gorentla Venkata, M., Hernandez, O., Shamis, P., Lopez, M.G., Baker, M., Poole, W.: Exploring openSHMEM model to program GPU-based extreme-scale systems. In: Gorentla Venkata, M., Shamis, P., Imam, N., Lopez, M.G. (eds.) OpenSHMEM 2014. LNCS, vol. 9397, pp. 18–35. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26428-8_2
NVIDIA: GPUDirect (2015). https://developer.nvidia.com/gpudirect
NVIDIA: GPUDirect RDMA (2015). http://docs.nvidia.com/cuda/gpudirect-rdma
Rossetti, D.: GPUDirect: integrating the GPU with a network interface. In: GPU Technology Conference (2015)
Wang, H., Potluri, S., Luo, M., Singh, A.K., Sur, S., Panda, D.K.: MVAPICH2-GPU: optimized GPU to GPU communication for infiniband clusters. Comput. Sci. 26, 257–266 (2011)
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for infiniband clusters with NVIDIA GPUs. In: Proceedings of the 2013 42nd International Conference on Parallel Processing, ICPP 2013, Washington, DC, USA, pp. 80–89. IEEE Computer Society (2013)
MVAPICH: MPI over infiniband, 10GigE/iWARP and RoCE (2015). http://mvapich.cse.ohio-state.edu
Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.C., Bisset, K.R., Thakur, R.: MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: 14th IEEE International Conference on High Performance Computing and Communications, Liverpool, UK (2012)
Potluri, S., Bureddy, D., Wang, H., Subramoni, H., Panda, D.K.: Extending openSHMEM for GPU computing. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS 2013, Washington, DC, USA, pp. 1001–1012. IEEE Computer Society (2013)
Cunningham, D., Bordawekar, R., Saraswat, V.: GPU programming in a high level language: compiling X10 to CUDA. In: Proceedings of the 2011 ACM SIGPLAN X10 Workshop, X10 2011, pp. 8:1–8:10. ACM, New York (2011)
Miyoshi, T., Irie, H., Shima, K., Honda, H., Kondo, M., Yoshinaga, T.: Flat: a GPU programming framework to provide embedded MPI. In: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pp. 20–29. ACM, New York (2012)
Ueno, K., Suzumura, T.: Parallel distributed breadth first search on GPU. In: 20th Annual International Conference on High Performance Computing, HiPC 2013, Bengaluru (Bangalore), Karnataka, India, 18–21 December 2013, pp. 314–323 (2013)
Matsuoka, S.: Making TSUBAME2.0, the world’s greenest production supercomputer, even greener: challenges to the architects. In: Proceedings of the 2011 International Symposium on Low Power Electronics and Design, Fukuoka, Japan, 1–3 August 2011, pp. 367–368 (2011)
Bisson, M., Bernaschi, M., Mastrostefano, E.: Parallel distributed breadth first search on the Kepler architecture. IEEE Trans. Parallel Distrib. Syst. 27, 2091–2102 (2016)
Pan, Y., Wang, Y., Wu, Y., Yang, C., Owens, J.D.: Multi-GPU graph analytics. CoRR abs/1504.04804 (2015)
Acknowledgments
This research is supported in part by Oak Ridge National Lab, subcontract #4000145249. We would like to thank M. Bisson et al., authors of the multi-GPU implementation of BFS we have used as the baseline in this paper [17]. They have shared their code and have supported this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Potluri, S., Goswami, A., Venkata, M.G., Imam, N. (2018). Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds) OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence. OpenSHMEM 2017. Lecture Notes in Computer Science(), vol 10679. Springer, Cham. https://doi.org/10.1007/978-3-319-73814-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-73814-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73813-0
Online ISBN: 978-3-319-73814-7
eBook Packages: Computer ScienceComputer Science (R0)