Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3327345.3327357guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

Sparsified SGD with memory

Published: 03 December 2018 Publication History

Abstract

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.
In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.

References

[1]
Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440-445. Association for Computational Linguistics, 2017.
[2]
Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. The convergence of stochastic gradient descent in asynchronous shared memory. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC '18, pages 169-178, New York, NY, USA, 2018. ACM.
[3]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 1709-1720. Curran Associates, Inc., 2017.
[4]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. The convergence of sparsified gradient methods. In NIPS 2018, to appear and CoRR abs/1809.10505, 2018.
[5]
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT'2010, pages 177-186, Heidelberg, 2010. Physica-Verlag HD.
[6]
Leon Bottou. Stochastic Gradient Descent Tricks, volume 7700, page 430-445. Springer, January 2012.
[7]
Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. Adacomp : Adaptive residual gradient compression for data-parallel distributed training. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press, 2018.
[8]
Jean-Baptiste Cordonnier. Convex optimization using sparsified stochastic gradient descent with memory. Master's thesis, EPFL, Lausanne, Switzerland, 2018.
[9]
Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, MLHPC '16, pages 1-8, Piscataway, NJ, USA, 2016. IEEE Press.
[10]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159, August 2011.
[11]
Celestine Dünner, Thomas Parnell, and Martin Jaggi. Efficient use of limited-memory accelerators for linear learning on heterogeneous systems. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 4258-4267. Curran Associates, Inc., 2017.
[12]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour. CoRR, abs/1706.02677, 2017.
[13]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML'15, pages 1737-1746. JMLR.org, 2015.
[14]
Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit Dhillon. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning, pages 2370-2379, 2015.
[15]
Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.
[16]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
[17]
Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002, 2012.
[18]
Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous parallel SAGA. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 46-54, Fort Lauderdale, FL, USA, 20-22 Apr 2017. PMLR.
[19]
Rémi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods. CoRR, abs/1801.03749, January 2018.
[20]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
[21]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018.
[22]
Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.
[23]
Eric Moulines and Francis R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, NIPS - Advances in Neural Information Processing Systems 24, pages 451-459. Curran Associates, Inc., 2011.
[24]
Taesik Na, Jong Hwan Ko, Jaeha Kung, and Saibal Mukhopadhyay. On-chip training of recurrent neural networks with limited numerical precision. 2017 International Joint Conference on Neural Networks (IJCNN), pages 3716-3723, 2009.
[25]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS - Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS'11, pages 693-701, USA, 2011. Curran Associates Inc.
[26]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825-2830, 2011.
[27]
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838-855, 1992.
[28]
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, pages 1571-1578, USA, 2012. Omnipress.
[29]
Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400-407, September 1951.
[30]
David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
[31]
Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In NIPS - Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS'15, pages 2674-2682, Cambridge, MA, USA, 2015. MIT Press.
[32]
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83-112, March 2017.
[33]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, editors, INTERSPEECH, pages 1058-1062. ISCA, 2014.
[34]
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 71-79, Atlanta, Georgia, USA, 17-19 Jun 2013. PMLR.
[35]
Soren Sonnenburg, Vojtvech Franc, E. Yom-Tov, and M. Sebag. Pascal large scale learning challenge. 10:1937-1953, 01 2008.
[36]
Sebastian U. Stich. Local SGD converges fast and communicates little. CoRR, abs/1805.09767, May 2018.
[37]
Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492. ISCA, 2015.
[38]
Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3299-3308, International Convention Centre, Sydney, Australia, 06-11 Aug 2017. PMLR.
[39]
Hanlin Tang, Shaoduo Gan, Ce Zhang, and Ji Liu. Communication compression for decentralized training. In NIPS 2018, to apprear and CoRR abs/1803.06443, 2018.
[40]
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In NIPS 2018, to appear and CoRR abs/1710.09854, 2018.
[41]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30, pages 1509-1519. Curran Associates, Inc., 2017.
[42]
Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML 2018 - Proceedings of the 35th International Conference on Machine Learning, pages 5321-5329, July 2018.
[43]
Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for ImageNet training. CoRR, abs/1708.03888, 2017.
[44]
Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035-4043, International Convention Centre, Sydney, Australia, 06-11 Aug 2017. PMLR.
[45]
Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS - Advances in Neural Information Processing Systems 25, pages 1502-1510. Curran Associates, Inc., 2012.
[46]
Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1-9, Lille, France, 07-09 Jul 2015. PMLR.

Cited By

View all
  • (2022)LR-SGD: Layer-based Random SGD For Distributed Deep LearningProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532215(6-11)Online publication date: 18-Mar-2022
  • (2022)Communication-Efficient Federated Learning with Adaptive QuantizationACM Transactions on Intelligent Systems and Technology10.1145/351058713:4(1-26)Online publication date: 4-Aug-2022
  • (2022)Increasing ising machine capacity with multi-chip architecturesProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527414(508-521)Online publication date: 18-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems
December 2018
11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)LR-SGD: Layer-based Random SGD For Distributed Deep LearningProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532215(6-11)Online publication date: 18-Mar-2022
  • (2022)Communication-Efficient Federated Learning with Adaptive QuantizationACM Transactions on Intelligent Systems and Technology10.1145/351058713:4(1-26)Online publication date: 4-Aug-2022
  • (2022)Increasing ising machine capacity with multi-chip architecturesProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527414(508-521)Online publication date: 18-Jun-2022
  • (2021)DANCE: Distributed Generative Adversarial Networks with Communication CompressionACM Transactions on Internet Technology10.1145/345892922:2(1-32)Online publication date: 22-Oct-2021
  • (2021)Towards Communication-Efficient and Attack-Resistant Federated Edge Learning for Industrial Internet of ThingsACM Transactions on Internet Technology10.1145/345316922:3(1-22)Online publication date: 6-Dec-2021
  • (2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
  • (2019)Qsparse-local-SGDProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455603(14695-14706)Online publication date: 8-Dec-2019
  • (2019)Communication-efficient distributed SGD with sketchingProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455465(13142-13152)Online publication date: 8-Dec-2019
  • (2019)Double quantization for communication-efficient distributed optimizationProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454686(4438-4449)Online publication date: 8-Dec-2019
  • (2019)Faster distributed deep net trainingProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367471.3367680(4582-4589)Online publication date: 10-Aug-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media