research-article

Scaling up stochastic gradient descent for non-convex optimisation

Authors:

Abdelhamid BouchachiaAuthors Info & Claims

Machine Learning, Volume 111, Issue 11

Pages 4039 - 4079

https://doi.org/10.1007/s10994-022-06243-3

Published: 01 November 2022 Publication History

Abstract

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions and large datasets. We address the bottleneck problem arising when using both shared and distributed memory. Typically, the former is bounded by limited computation resources and bandwidth whereas the latter suffers from communication overheads. We propose a unified distributed and parallel implementation of SGD (named DPSGD) that relies on both asynchronous distribution and lock-free parallelism. By combining two strategies into a unified framework, DPSGD is able to strike a better trade-off between local computation and communication. The convergence properties of DPSGD are studied for non-convex problems such as those arising in statistical modelling and machine learning. Our theoretical analysis shows that DPSGD leads to speed-up with respect to the number of cores and number of workers while guaranteeing an asymptotic convergence rate of

O (1 / \sqrt{T})

given that the number of cores is bounded by

T^{1 / 4}

and the number of workers is bounded by

T^{1 / 2}

where T is the number of iterations. The potential gains that can be achieved by DPSGD are demonstrated empirically on a stochastic variational inference problem (Latent Dirichlet Allocation) and on a deep reinforcement learning (DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI and HSA2C. Empirical results validate our theoretical findings. Comparative studies are conducted to show the performance of the proposed DPSGD against the state-of-the-art DRL algorithms.

References

[1]

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow: A system for large-scale machine learning OSDI 2016 16 265-283

[2]

Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems (pp. 1–8).

[3]

Adamski, I., Adamski, R., Grel, T., Jędrych, A., Kaczmarek, K., & Michalewski, H. (2018). Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes. arXiv preprint arXiv:1801.02852.

[4]

Adamski, R., Grel, T., Klimek, M., & Michalewski, H. (2017). Atari games and intel processors. Workshop on Computer Games (pp. 1–18). Springer.

[5]

Agarwal, A., & Duchi, J.C. (2011). Distributed delayed stochastic optimization. In Neural Information Processing Systems.

[6]

Ba, J., Grosse, R., & Martens, J. (2016). Distributed second-order optimization using kronecker-factored approximations.

[7]

Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2016). Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256.

[8]

Bellemare MG, Naddaf Y, Veness J, and Bowling M The arcade learning environment: An evaluation platform for general agents Journal of Artificial Intelligence Research 2013 47 253-279

[9]

Blei DM, Kucukelbir A, and McAuliffe JD Variational inference: A review for statisticians Journal of the American Statistical Association 2017 112 859-877

[10]

Blei DM, Ng AY, and Jordan MI Latent dirichlet allocation Journal of Machine Learning research 2003 3 993-1022

[11]

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.

[12]

Bottou L, Curtis FE, and Nocedal J Optimization methods for large-scale machine learning SIAM Review 2018 60 2 223-311

[13]

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.

[14]

Chilimbi, T. M., Suzue, Y., Apacible, J., & Kalyanaraman, K. (2014). Project adam: Building an efficient and scalable deep learning training system. In OSDI.

[15]

Clemente, A.V., Castejón, H.N., & Chandra, A. (2017). Efficient parallel methods for deep reinforcement learning. arXiv preprint arXiv:1705.04862.

[16]

Crane, R., & Roosta, F. (2019). Dingo: Distributed newton-type method for gradient-norm optimization. arXiv preprint arXiv:1901.05134.

[17]

De Sa C, Zhang C, Olukotun K, and Ré C Taming the wild: A unified analysis of hogwild!-style algorithms Advances in Neural Information Processing Systems 2015 28 2656

[18]

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., & Le, Q.V., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223–1231).

[19]

Dekel O, Gilad-Bachrach R, Shamir O, and Xiao L Optimal distributed online prediction using mini-batches Journal of Machine Learning Research 2012 13 165-202

[20]

Duchi, J.C., Chaturapruek, S., & Ré, C. (2015). Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882.

[21]

Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, and Aggarwal V Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning IEEE Transactions on Communications 2020 69 164-181

[22]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., & Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.

[23]

Fang, C., & Lin, Z. (2017). Parallel asynchronous stochastic variance reduction for nonconvex optimization. In AAAI.

[24]

Ghadimi S and Lan G Stochastic first-and zeroth-order methods for nonconvex stochastic programming SIAM Journal on Optimization 2013 23 2341-2368

[25]

Hoffman MD, Blei DM, Wang C, and Paisley J Stochastic variational inference Journal of Machine Learning Research 2013 14 1303-1347

[26]

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.

[27]

Hsieh, C.-J., Yu, H.-F., & Dhillon, I. (2015). Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning (pp. 2370–2379). PMLR.

[28]

Huo, Z., & Huang, H. (2016). Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. arXiv preprint arXiv:1604.03584.

[29]

Huo, Z., & Huang, H. (2017). Asynchronous mini-batch gradient descent with variance reduction for non-convex optimization. In AAAI.

[30]

Ipek E, Mutlu O, Martínez JF, and Caruana R Self-optimizing memory controllers: A reinforcement learning approach ACM SIGARCH Computer Architecture News 2008 36 39-50

[31]

Jahani, M., He, X., Ma, C., Mokhtari, A., Mudigere, D., Ribeiro, A., & Takác, M. (2020a). Efficient distributed hessian free algorithm for large-scale empirical risk minimization via accumulating sample strategy. In International Conference on Artificial Intelligence and Statistics (pp. 2634–2644). PMLR.

[32]

Jahani, M., Nazari, M., Rusakov, S., Berahas, A. S., & Takáč, M. (2020b). Scaling up quasi-newton algorithms: Communication efficient distributed sr1. In International Conference on Machine Learning, Optimization, and Data Science (pp. 41–54). Springer.

[33]

Jordan MI, Ghahramani Z, Jaakkola TS, and Saul LK An introduction to variational methods for graphical models Machine Learning 1999 37 183-233

[34]

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105).

[35]

Langford, J., Smola, A.J., & Zinkevich, M. (2009). Slow learners are fast. Neural Information Processing Systems.

[36]

Leblond, R., Pedregosa, F., & Lacoste-Julien, S. (2017). Asaga: asynchronous parallel saga. In Artificial Intelligence and Statistics (pp. 46–54). PMLR.

[37]

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., & Su, B.-Y. (2014a). Scaling distributed machine learning with the parameter server. In OSDI.

[38]

Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014b). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670). ACM.

[39]

Lian, X., Huang, Y., Li, Y., & Liu, J. (2015). Asynchronous parallel stochastic gradient for nonconvex optimization. In Neural Information Processing Systems.

[40]

Lian, X., Zhang, W., Zhang, C., & Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (pp. 3043–3052). PMLR.

[41]

Lichman, M. (2013). UCI machine learning repository.

[42]

Lin, T., Stich, S. U., Patel, K. K., & Jaggi, M. (2018). Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217.

[43]

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (pp. 1928–1937).

[44]

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[45]

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning Nature 2015 518 7540 529

[46]

Mohamad, S., Bouchachia, A., & Sayed-Mouchaweh, M. (2018). Asynchronous stochastic variational inference. arXiv preprint arXiv:1801.04289.

[47]

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., & Petersen, S., et al. (2015). Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296.

[48]

Neiswanger, W., Wang, C., & Xing, E. (2015). Embarrassingly parallel variational inference in nonconjugate models. arXiv preprint arXiv:1510.04163.

[49]

Nemirovski A, Juditsky A, Lan G, and Shapiro A Robust stochastic approximation approach to stochastic programming SIAM Journal on Optimization 2009 19 4 1574-1609

[50]

Niu, F., Recht, B., Ré, C., & Wright, S. J. (2011). Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730.

[51]

Ong, H. Y., Chavez, K., & Hong, A. (2015). Distributed deep q-learning. arXiv preprint arXiv:1508.04186.

[52]

Paine, T., Jin, H., Yang, J., Lin, Z., & Huang, T. (2013). Gpu asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186.

[53]

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.

[54]

Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems.

[55]

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2015). On variance reduction in stochastic gradient descent and its asynchronous variants. arXiv preprint arXiv:1506.06840.

[56]

Robbins H and Monro S A stochastic approximation method The Annals of Mathematical Statistics 1951 22 400-407

[57]

Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

[58]

Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

[59]

Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning (pp. 1000–1008). PMLR.

[60]

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search Nature 2016 529 7587 484-489

[61]

Stich, S. U. (2018). Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767.

[62]

Stooke, A., & Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.

[63]

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). MIT Press.

[64]

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

[65]

Sutton RS, McAllester DA, Singh SP, and Mansour Y Policy gradient methods for reinforcement learning with function approximation Advances in Neural Information Processing Systems 2000 12 1057-1063

[66]

Theocharous, G., Thomas, P.S., & Ghavamzadeh, M. (2015). Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI (pp. 1806–1812).

[67]

Tsitsiklis J, Bertsekas D, and Athans M Distributed asynchronous deterministic and stochastic gradient optimization algorithms IEEE Transactions on Automatic Control 1986 31 9 803-812

[68]

Wainwright, M.J., & Jordan, M.I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning.

[69]

Wang, J., Sahu, A. K., Yang, Z., Joshi, G., & Kar, S. (2019). Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC) (pp. 299–300). IEEE.

[70]

Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning (pp. 5–32). Springer.

[71]

Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., & Yu, Y. (2015). Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data.

[72]

Yu H, Yang S, and Zhu S Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning Proceedings of the AAAI Conference on Artificial Intelligence 2019 33 5693-5700

[73]

Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging sgd. In Neural Information Processing Systems.

[74]

Zhao, S.-Y., & Li, W.-J. (2016). Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In AAAI.

[75]

Zhao, S.-Y., Zhang, G.-D., & Li, W.-J. (2017). Lock-free optimization for non-convex problems. In AAAI.

[76]

Zhou, F., & Cong, G. (2017). On the convergence properties of a

k

-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012.

[77]

Zinkevich, M., Weimer, M., Li, L., & Smola, A. J. (2010). Parallelized stochastic gradient descent. In Neural Information Processing Systems.

Index Terms

Scaling up stochastic gradient descent for non-convex optimisation

Index terms have been assigned to the content through auto-classification.

Recommendations

Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms
MLHPC '15: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments

The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of ...
Stochastic gradient descent as approximate Bayesian inference

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian ...
Stochastic Gradient Descent with Polyak’s Learning Rate
Abstract
Stochastic gradient descent (SGD) for strongly convex functions converges at the rate $O (1 / k)$ . However, achieving good results in practice requires tuning the parameters (for example the learning rate) of the algorithm. In this paper we propose a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Machine Language

Machine Language Volume 111, Issue 11

Nov 2022

396 pages

ISSN:0885-6125

Issue’s Table of Contents

© The Author(s) 2022.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2022

Accepted: 07 July 2022

Revision received: 12 April 2022

Received: 05 July 2020

Author Tags

Qualifiers

Research-article

Funding Sources

Horizon 2020 Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents