Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3369583.3392681acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks

Published: 23 June 2020 Publication History

Abstract

The performance and efficiency of distributed training of Deep Neural Networks (DNN) highly depend on the performance of gradient averaging among participating processes, a step bound by communication costs. There are two major approaches to reduce communication overhead: overlap communications with computations (lossless), or reduce communications (lossy). The lossless solution works well for linear neural architectures, e.g. VGG, AlexNet, but more recent networks such as ResNet and Inception limit the opportunity for such overlapping. Therefore, approaches that reduce the amount of data (lossy) become more suitable. In this paper, we present a novel, explainable lossy method that sparsifies gradients in the frequency domain, in addition to a new range-based float point representation to quantize and further compress gradients. These dynamic techniques strike a balance between compression ratio, accuracy, and computational overhead, and are optimized to maximize performance in heterogeneous environments.
Unlike existing works that strive for a higher compression ratio, we stress the robustness of our methods, and provide guidance to recover accuracy from failures. To achieve this, we prove how the FFT sparsification affects the convergence and accuracy, and show that our method is guaranteed to converge using a diminishing θ in training. Reducing θ can also be used to recover accuracy from the failure. Compared to STOA lossy methods, e.g., QSGD, TernGrad, and Top-k sparsification, our approach incurs less approximation error, thereby better in both the wall-time and accuracy. On an 8 GPUs, InfiniBand interconnected cluster, our techniques effectively accelerate AlexNet training up to 2.26x to the baseline of no compression, and 1.31x to QSGD, 1.25x to Terngrad and 1.47x to Top-K sparsification.

Supplementary Material

MP4 File (3369583.3392681.mp4)
Video presentation of the paper "FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks".

References

[1]
Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI.
[2]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).
[3]
Tolu Alabi, Jeffrey D Blanchard, Bradley Gordon, and Russel Steinbach. 2012. Fast k-selection algorithms for graphics processing units. Journal of Experimental Algorithmics (JEA), Vol. 17 (2012), 4--2.
[4]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems.
[5]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems. 5975--5985.
[6]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-Caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
[7]
Alex Berson. 1992. Client-server architecture. Number IEEE-802. McGraw-Hill.
[8]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[9]
Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems. 2674--2682.
[10]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems.
[11]
S. Di and F. Cappello. 2016. Fast Error-Bounded Lossy HPC Data Compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[12]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting.
[13]
Saeed Ghadimi and Guanghui Lan. 2013. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, Vol. 23, 4 (2013), 2341--2368.
[14]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[16]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia.
[17]
Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. 2019. DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression (HPDC '19). Association for Computing Machinery, New York, NY, USA, 159--170. https://doi.org/10.1145/3307681.3326608
[18]
Urs Köster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J Pai, and Naveen Rao. 2017. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 1742--1752.
[19]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI.
[20]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).
[21]
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, Vol. 19, 4 (2009), 1574--1609.
[22]
Cédric Renggli, Dan Alistarh, and Torsten Hoefler. 2018. SparCML: High-Performance Sparse Communication for Machine Learning. CoRR, Vol. abs/1802.08021 (2018).
[23]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on.
[24]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.
[25]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.
[26]
Linnan Wang, Yi Yang, Renqiang Min, and Srimat Chakradhar. 2017. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks, Vol. 93 (2017), 219--229.
[27]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
[28]
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems. 1306--1316.
[29]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems.
[30]
Yiyang Zhao, Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Jinmian Ye, Wenqi Tang, and Zenglin Xu. 2017. Efficient Communications in Training Large Scale Neural Networks. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017.

Cited By

View all
  • (2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
  • (2023)Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory ArraysACM Journal on Emerging Technologies in Computing Systems10.1145/357721419:2(1-24)Online publication date: 18-May-2023
  • (2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FFT
  2. gradient compression
  3. loosy gradients
  4. machine learning
  5. neural networks

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

HPDC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
  • (2023)Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory ArraysACM Journal on Emerging Technologies in Computing Systems10.1145/357721419:2(1-24)Online publication date: 18-May-2023
  • (2023)Communication Optimization Algorithms for Distributed Deep Learning Systems: A SurveyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332328234:12(3294-3308)Online publication date: Dec-2023
  • (2023)DenseStream: A Novel Data Representation for Gradient Sparsification in Distributed Synchronous SGD Algorithms2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191729(1-8)Online publication date: 18-Jun-2023
  • (2023)Dynamic layer-wise sparsification for distributed deep learningFuture Generation Computer Systems10.1016/j.future.2023.04.022147(1-15)Online publication date: Oct-2023
  • (2023)NetShield: An in-network architecture against byzantine failures in distributed deep learningComputer Networks10.1016/j.comnet.2023.110081237(110081)Online publication date: Dec-2023
  • (2022)SSD-SGD: Communication Sparsification for Distributed Deep Learning TrainingACM Transactions on Architecture and Code Optimization10.1145/356303820:1(1-25)Online publication date: 16-Dec-2022
  • (2022)Near-optimal sparse allreduce for distributed deep learningProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508399(135-149)Online publication date: 2-Apr-2022
  • (2022)MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.3154387(1-1)Online publication date: 2022
  • (2022)SaPus: Self-Adaptive Parameter Update Strategy for DNN Training on Multi-GPU ClustersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311860933:7(1569-1580)Online publication date: 1-Jul-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media