Article

Free access

Sparsified SGD with memory

Authors:

Sebastian U. Stich,

Jean-Baptiste Cordonnier,

Martin JaggiAuthors Info & Claims

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

Pages 4452 - 4463

Published: 03 December 2018 Publication History

PDF eReader Publisher Site

Abstract

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.

In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.

References

[1]

Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440-445. Association for Computational Linguistics, 2017.

[2]

Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. The convergence of stochastic gradient descent in asynchronous shared memory. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC '18, pages 169-178, New York, NY, USA, 2018. ACM.

Digital Library

[3]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 1709-1720. Curran Associates, Inc., 2017.

Digital Library

[4]

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. The convergence of sparsified gradient methods. In NIPS 2018, to appear and CoRR abs/1809.10505, 2018.

[5]

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT'2010, pages 177-186, Heidelberg, 2010. Physica-Verlag HD.

[6]

Leon Bottou. Stochastic Gradient Descent Tricks, volume 7700, page 430-445. Springer, January 2012.

[7]

Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. Adacomp : Adaptive residual gradient compression for data-parallel distributed training. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press, 2018.

[8]

Jean-Baptiste Cordonnier. Convex optimization using sparsified stochastic gradient descent with memory. Master's thesis, EPFL, Lausanne, Switzerland, 2018.

[9]

Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, MLHPC '16, pages 1-8, Piscataway, NJ, USA, 2016. IEEE Press.

Digital Library

[10]

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159, August 2011.

Digital Library

[11]

Celestine Dünner, Thomas Parnell, and Martin Jaggi. Efficient use of limited-memory accelerators for linear learning on heterogeneous systems. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 4258-4267. Curran Associates, Inc., 2017.

Digital Library

[12]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour. CoRR, abs/1706.02677, 2017.

[13]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML'15, pages 1737-1746. JMLR.org, 2015.

Digital Library

[14]

Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit Dhillon. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning, pages 2370-2379, 2015.

Digital Library

[15]

Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.

[16]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[17]

Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002, 2012.

[18]

Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous parallel SAGA. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 46-54, Fort Lauderdale, FL, USA, 20-22 Apr 2017. PMLR.

[19]

Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. CoRR, abs/1801.03749, January 2018.

Digital Library

[20]

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

Digital Library

[21]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018.

[22]

Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.

Digital Library

[23]

Eric Moulines and Francis R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, NIPS - Advances in Neural Information Processing Systems 24, pages 451-459. Curran Associates, Inc., 2011.

Digital Library

[24]

Taesik Na, Jong Hwan Ko, Jaeha Kung, and Saibal Mukhopadhyay. On-chip training of recurrent neural networks with limited numerical precision. 2017 International Joint Conference on Neural Networks (IJCNN), pages 3716-3723, 2009.

[25]

Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS - Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS'11, pages 693-701, USA, 2011. Curran Associates Inc.

Digital Library

[26]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825-2830, 2011.

Digital Library

[27]

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838-855, 1992.

Digital Library

[28]

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, pages 1571-1578, USA, 2012. Omnipress.

Digital Library

[29]

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400-407, September 1951.

[30]

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

[31]

Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In NIPS - Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS'15, pages 2674-2682, Cambridge, MA, USA, 2015. MIT Press.

Digital Library

[32]

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83-112, March 2017.

Digital Library

[33]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, editors, INTERSPEECH, pages 1058-1062. ISCA, 2014.

[34]

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 71-79, Atlanta, Georgia, USA, 17-19 Jun 2013. PMLR.

Digital Library

[35]

Soren Sonnenburg, Vojtvech Franc, E. Yom-Tov, and M. Sebag. Pascal large scale learning challenge. 10:1937-1953, 01 2008.

[36]

Sebastian U. Stich. Local SGD converges fast and communicates little. CoRR, abs/1805.09767, May 2018.

[37]

Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492. ISCA, 2015.

[38]

Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3299-3308, International Convention Centre, Sydney, Australia, 06-11 Aug 2017. PMLR.

Digital Library

[39]

Hanlin Tang, Shaoduo Gan, Ce Zhang, and Ji Liu. Communication compression for decentralized training. In NIPS 2018, to apprear and CoRR abs/1803.06443, 2018.

[40]

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In NIPS 2018, to appear and CoRR abs/1710.09854, 2018.

[41]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 1509-1519. Curran Associates, Inc., 2017.

Digital Library

[42]

Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML 2018 - Proceedings of the 35th International Conference on Machine Learning, pages 5321-5329, July 2018.

[43]

Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for ImageNet training. CoRR, abs/1708.03888, 2017.

[44]

Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035-4043, International Convention Centre, Sydney, Australia, 06-11 Aug 2017. PMLR.

Digital Library

[45]

Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS - Advances in Neural Information Processing Systems 25, pages 1502-1510. Curran Associates, Inc., 2012.

Digital Library

[46]

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1-9, Lille, France, 07-09 Jul 2015. PMLR.

Digital Library

Cited By

Zhang ZHu YYe Q(2022)LR-SGD: Layer-based Random SGD For Distributed Deep LearningProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532215(6-11)Online publication date: 18-Mar-2022
https://dl.acm.org/doi/10.1145/3532213.3532215
Mao YZhao ZYan GLiu YLan TSong LDing W(2022)Communication-Efficient Federated Learning with Adaptive QuantizationACM Transactions on Intelligent Systems and Technology10.1145/351058713:4(1-26)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3510587
Sharma AAfoakwa RIgnjatovic ZHuang MSalapura VZahran MChong FTang L(2022)Increasing ising machine capacity with multi-chip architecturesProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527414(508-521)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527414
Show More Cited By

Recommendations

SGD: the role of implicit regularization, batch-size and multiple epochs
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an implicit ...
Is local SGD better than minibatch SGD?
ICML'20: Proceedings of the 37th International Conference on Machine Learning

We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the ...
SGD-r α: A real-time α-suffix averaging method for SGD with biased gradient estimates
Abstract
Stochastic gradient descent (SGD) is a simple and efficient method for solving large-scale stochastic optimization problems in machine learning. It has been shown that the convergence rate of SGD can be improved by α-suffix averaging ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

December 2018

11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
163
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZHu YYe Q(2022)LR-SGD: Layer-based Random SGD For Distributed Deep LearningProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532215(6-11)Online publication date: 18-Mar-2022
https://dl.acm.org/doi/10.1145/3532213.3532215
Mao YZhao ZYan GLiu YLan TSong LDing W(2022)Communication-Efficient Federated Learning with Adaptive QuantizationACM Transactions on Intelligent Systems and Technology10.1145/351058713:4(1-26)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3510587
Sharma AAfoakwa RIgnjatovic ZHuang MSalapura VZahran MChong FTang L(2022)Increasing ising machine capacity with multi-chip architecturesProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527414(508-521)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527414
Zhang XZhu XWang JBao WYang L(2021)DANCE: Distributed Generative Adversarial Networks with Communication CompressionACM Transactions on Internet Technology10.1145/345892922:2(1-32)Online publication date: 22-Oct-2021
https://dl.acm.org/doi/10.1145/3458929
Liu YZhao RKang JYassine ANiyato DPeng J(2021)Towards Communication-Efficient and Attack-Resistant Federated Edge Learning for Industrial Internet of ThingsACM Transactions on Internet Technology10.1145/345316922:3(1-22)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3453169
Cohen EPagh RWoodruff DLarochelle HRanzato MHadsell RBalcan MLin H(2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497495
Basu DData DKarakus CDiggavi SWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Qsparse-local-SGDProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455603(14695-14706)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455603
Ivkin NRothchild DUllah EBraverman VStoica IArora RWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Communication-efficient distributed SGD with sketchingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455465(13142-13152)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455465
Yu YWu JHuang LWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Double quantization for communication-efficient distributed optimizationProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454686(4438-4449)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454686
Shen SXu LLiu JLiang XCheng Y(2019)Faster distributed deep net trainingProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367471.3367680(4582-4589)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367471.3367680
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents