Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3650200.3656636acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Published: 03 June 2024 Publication History

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th { USENIX} symposium on operating systems design and implementation ({ OSDI} 16). 265–283.
[2]
Ahmed M. Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. arxiv:2101.10761 [cs.LG]
[3]
George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI Collective Communication on BlueGene/L Systems. In Proceedings of the 19th Annual International Conference on Supercomputing (Cambridge, Massachusetts) (ICS ’05). Association for Computing Machinery, New York, NY, USA, 253–262. https://doi.org/10.1145/1088149.1088183
[4]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 193–205.
[5]
Alan Ayala, Stanimire Tomov, Xi Luo, Hejer Shaeik, Azzam Haidar, George Bosilca, and Jack Dongarra. 2019. Impacts of Multi-GPU MPI collective communications on large FFT computation. In 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI). IEEE, 12–18.
[6]
M. Bayatpour and M. A. Hashmi. 2018. SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 1–10. https://doi.org/10.1109/CLUSTER.2018.00009
[7]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386–400.
[8]
NVIDIA Corp.2023. NCCL – Optimized primitives for inter-GPU communication.https://github.com/NVIDIA/nccl.
[9]
Sheng Di and Franck Cappello. 2016. Fast error-bounded lossy HPC data compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 730–739.
[10]
Jérôme Gurhem, Henri Calandra, and Serge G. Petiton. 2021. Parallel and Distributed Task-Based Kirchhoff Seismic Pre-Stack Depth Migration Application. In 2021 20th International Symposium on Parallel and Distributed Computing (ISPDC). 65–72. https://doi.org/10.1109/ISPDC52870.2021.9521599
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[12]
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. 2023. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression. arxiv:2304.03890 [cs.DC]
[13]
Jiajun Huang, Jinyang Liu, Sheng Di, Yujia Zhai, Zizhe Jian, Shixun Wu, Kai Zhao, Zizhong Chen, Yanfei Guo, and Franck Cappello. 2023. Exploring Wavelet Transform Usages for Error-bounded Scientific Data Compression. In 2023 IEEE International Conference on Big Data (BigData). 4233–4239. https://doi.org/10.1109/BigData59044.2023.10386386
[14]
Yafan Huang, Sheng Di, Xiaodong Yu, Guanpeng Li, and Franck Cappello. 2023. cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’23). Article 43, 13 pages. https://doi.org/10.1145/3581784.3607048
[15]
Arpan Jain, Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K Panda. 2019. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera. In 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS). IEEE, 76–83.
[16]
Suha Kayum 2020. GeoDRIVE – A high performance computing flexible platform for seismic applications. First Break 38, 2 (2020), 97–100.
[17]
Argonne National Laboratory. 2023. MPICH – A high-performance and widely portable implementation of the MPI-4.0 standard.https://www.mpich.org.
[18]
Peter Lindstrom. 2014. Fixed-Rate Compressed Floating-Point Arrays. IEEE Transactions on Visualization and Computer Graphics 20 (2014), 2674–2683.
[19]
Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Sian Jin, Zizhe Jian, Jiajun Huang, Shixun Wu, Zizhong Chen, and Franck Cappello. 2024. High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation. Proc. ACM Manag. Data 2, 1, Article 4 (mar 2024), 27 pages. https://doi.org/10.1145/3639259
[20]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117–124.
[21]
Sanjith S., R. Ganesan, and Rimal Isaac. 2015. Experimental Analysis of Compacted Satellite Image Quality Using Different Compression Methods. Advanced Science 7 (03 2015). https://doi.org/10.1166/asem.2015.1673
[22]
Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. https://doi.org/10.1109/sc41405.2020.00039
[23]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595
[24]
Maxim Vladimirovich Shcherbakov, Adriaan Brebels, Nataliya Lvovna Shcherbakova, Anton Pavlovich Tyukov, Timur Alexandrovich Janovsky, Valeriy Anatol’evich Kamaev, 2013. A survey of forecast error measures. World applied sciences journal 24, 24 (2013), 171–176.
[25]
Dingwen Tao, Sheng Di, and Franck Cappello. 2017. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization. https://doi.org/10.1109/IPDPS.2017.115
[26]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
[27]
Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly improving lossy compression for HPC datasets with second-order prediction and parameter optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 89–100.
[28]
Qinghua Zhou, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. DK Panda. 2022. Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads. In 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC). 22–31. https://doi.org/10.1109/HiPC56025.2022.00016
[29]
Qinghua Zhou, Quentin Anthony, Lang Xu, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, and Dhabaleswar K. DK Panda. 2023. Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 134–144. https://doi.org/10.1109/IPDPS54959.2023.00023
[30]
Q. Zhou, C. Chu, N. S. Kumar, P. Kousha, S. M. Ghazimirsaeed, H. Subramoni, and D. K. Panda. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 444–453. https://doi.org/10.1109/IPDPS49936.2021.00053
[31]
Qinghua Zhou, Pouya Kousha, Quentin Anthony, Kawthar Shafie Khorassani, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda. 2022. Accelerating MPI All-to-All Communication With Online Compression On Modern GPU Clusters. In High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings (Hamburg, Germany). Springer-Verlag, Berlin, Heidelberg, 3–25. https://doi.org/10.1007/978-3-031-07312-0_1

Cited By

View all
  • (2024)hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic CompressionProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00110(1-15)Online publication date: 17-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing
May 2024
582 pages
ISBN:9798400706103
DOI:10.1145/3650200
© 2024 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Collective Communication
  2. Compression
  3. GPU

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '24
Sponsor:

Acceptance Rates

ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)405
  • Downloads (Last 6 weeks)103
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic CompressionProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00110(1-15)Online publication date: 17-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media