research-article

Open access

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Authors:

Ken Raffenetti,

Franck Cappello,

Rajeev ThakurAuthors Info & Claims

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

Pages 437 - 448

https://doi.org/10.1145/3650200.3656636

Published: 03 June 2024 Publication History

All formats PDF

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5 × and 28.7 ×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th { USENIX} symposium on operating systems design and implementation ({ OSDI} 16). 265–283.

[2]

Ahmed M. Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. arxiv:2101.10761 [cs.LG]

[3]

George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI Collective Communication on BlueGene/L Systems. In Proceedings of the 19th Annual International Conference on Supercomputing (Cambridge, Massachusetts) (ICS ’05). Association for Computing Machinery, New York, NY, USA, 253–262. https://doi.org/10.1145/1088149.1088183

Digital Library

[4]

Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 193–205.

Digital Library

[5]

Alan Ayala, Stanimire Tomov, Xi Luo, Hejer Shaeik, Azzam Haidar, George Bosilca, and Jack Dongarra. 2019. Impacts of Multi-GPU MPI collective communications on large FFT computation. In 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI). IEEE, 12–18.

[6]

M. Bayatpour and M. A. Hashmi. 2018. SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 1–10. https://doi.org/10.1109/CLUSTER.2018.00009

[7]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of MPI usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386–400.

Digital Library

[8]

NVIDIA Corp.2023. NCCL – Optimized primitives for inter-GPU communication.https://github.com/NVIDIA/nccl.

[9]

Sheng Di and Franck Cappello. 2016. Fast error-bounded lossy HPC data compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 730–739.

[10]

Jérôme Gurhem, Henri Calandra, and Serge G. Petiton. 2021. Parallel and Distributed Task-Based Kirchhoff Seismic Pre-Stack Depth Migration Application. In 2021 20th International Symposium on Parallel and Distributed Computing (ISPDC). 65–72. https://doi.org/10.1109/ISPDC52870.2021.9521599

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Digital Library

[12]

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, and Rajeev Thakur. 2023. An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression. arxiv:2304.03890 [cs.DC]

[13]

Jiajun Huang, Jinyang Liu, Sheng Di, Yujia Zhai, Zizhe Jian, Shixun Wu, Kai Zhao, Zizhong Chen, Yanfei Guo, and Franck Cappello. 2023. Exploring Wavelet Transform Usages for Error-bounded Scientific Data Compression. In 2023 IEEE International Conference on Big Data (BigData). 4233–4239. https://doi.org/10.1109/BigData59044.2023.10386386

[14]

Yafan Huang, Sheng Di, Xiaodong Yu, Guanpeng Li, and Franck Cappello. 2023. cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’23). Article 43, 13 pages. https://doi.org/10.1145/3581784.3607048

Digital Library

[15]

Arpan Jain, Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K Panda. 2019. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera. In 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS). IEEE, 76–83.

[16]

Suha Kayum 2020. GeoDRIVE – A high performance computing flexible platform for seismic applications. First Break 38, 2 (2020), 97–100.

[17]

Argonne National Laboratory. 2023. MPICH – A high-performance and widely portable implementation of the MPI-4.0 standard.https://www.mpich.org.

[18]

Peter Lindstrom. 2014. Fixed-Rate Compressed Floating-Point Arrays. IEEE Transactions on Visualization and Computer Graphics 20 (2014), 2674–2683.

[19]

Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Sian Jin, Zizhe Jian, Jiajun Huang, Shixun Wu, Zizhong Chen, and Franck Cappello. 2024. High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation. Proc. ACM Manag. Data 2, 1, Article 4 (mar 2024), 27 pages. https://doi.org/10.1145/3639259

Digital Library

[20]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117–124.

Digital Library

[21]

Sanjith S., R. Ganesan, and Rimal Isaac. 2015. Experimental Analysis of Compacted Satellite Image Quality Using Different Compression Methods. Advanced Science 7 (03 2015). https://doi.org/10.1166/asem.2015.1673

[22]

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. https://doi.org/10.1109/sc41405.2020.00039

[23]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595

Digital Library

[24]

Maxim Vladimirovich Shcherbakov, Adriaan Brebels, Nataliya Lvovna Shcherbakova, Anton Pavlovich Tyukov, Timur Alexandrovich Janovsky, Valeriy Anatol’evich Kamaev, 2013. A survey of forecast error measures. World applied sciences journal 24, 24 (2013), 171–176.

[25]

Dingwen Tao, Sheng Di, and Franck Cappello. 2017. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization. https://doi.org/10.1109/IPDPS.2017.115

[26]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.

Digital Library

[27]

Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly improving lossy compression for HPC datasets with second-order prediction and parameter optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 89–100.

Digital Library

[28]

Qinghua Zhou, Quentin Anthony, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. DK Panda. 2022. Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads. In 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC). 22–31. https://doi.org/10.1109/HiPC56025.2022.00016

[29]

Qinghua Zhou, Quentin Anthony, Lang Xu, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, and Dhabaleswar K. DK Panda. 2023. Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 134–144. https://doi.org/10.1109/IPDPS54959.2023.00023

[30]

Q. Zhou, C. Chu, N. S. Kumar, P. Kousha, S. M. Ghazimirsaeed, H. Subramoni, and D. K. Panda. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 444–453. https://doi.org/10.1109/IPDPS49936.2021.00053

[31]

Qinghua Zhou, Pouya Kousha, Quentin Anthony, Kawthar Shafie Khorassani, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda. 2022. Accelerating MPI All-to-All Communication With Online Compression On Modern GPU Clusters. In High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings (Hamburg, Germany). Springer-Verlag, Berlin, Heidelberg, 3–25. https://doi.org/10.1007/978-3-031-07312-0_1

Digital Library

Cited By

Zhou QRamesh BShafi AAbduljabbar MSubramoni HPanda D(2024)Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU ClustersISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528931(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528931
Huang JDi SYu XZhai YLiu JJian ZLiang XZhao KLu XChen ZCappello FGuo YThakur R(2024)hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic CompressionProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00110(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00110
Huang YDi SLi GCappello F(2024)CUSZP2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression RatioSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00021(1-18)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00021

Index Terms

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Recommendations

TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Exploiting parallelism to train deep learning models requires GPUs to cooperate through collective communication primitives. While systems like DGX, equipped with proprietary interconnects, have been extensively studied, the systems where GPUs mainly ...
POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

May 2024

582 pages

ISBN:9798400706103

DOI:10.1145/3650200

Copyright © 2024 ACM.

© 2024 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '24

Sponsor:

SIGARCH

ICS '24: 2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto, Japan

Acceptance Rates

ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
649
Total Downloads

Downloads (Last 12 months)649
Downloads (Last 6 weeks)104

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou QRamesh BShafi AAbduljabbar MSubramoni HPanda D(2024)Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU ClustersISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528931(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528931
Huang JDi SYu XZhai YLiu JJian ZLiang XZhao KLu XChen ZCappello FGuo YThakur R(2024)hZCCL: Accelerating Collective Communication with Co-Designed Homomorphic CompressionProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00110(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00110
Huang YDi SLi GCappello F(2024)CUSZP2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression RatioSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00021(1-18)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00021

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten