Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Scaling up stochastic gradient descent for non-convex optimisation

Published: 01 November 2022 Publication History

Abstract

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions and large datasets. We address the bottleneck problem arising when using both shared and distributed memory. Typically, the former is bounded by limited computation resources and bandwidth whereas the latter suffers from communication overheads. We propose a unified distributed and parallel implementation of SGD (named DPSGD) that relies on both asynchronous distribution and lock-free parallelism. By combining two strategies into a unified framework, DPSGD is able to strike a better trade-off between local computation and communication. The convergence properties of DPSGD are studied for non-convex problems such as those arising in statistical modelling and machine learning. Our theoretical analysis shows that DPSGD leads to speed-up with respect to the number of cores and number of workers while guaranteeing an asymptotic convergence rate of O(1/T) given that the number of cores is bounded by T1/4 and the number of workers is bounded by T1/2 where T is the number of iterations. The potential gains that can be achieved by DPSGD are demonstrated empirically on a stochastic variational inference problem (Latent Dirichlet Allocation) and on a deep reinforcement learning (DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI and HSA2C. Empirical results validate our theoretical findings. Comparative studies are conducted to show the performance of the proposed DPSGD against the state-of-the-art DRL algorithms.

References

[1]
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow: A system for large-scale machine learning OSDI 2016 16 265-283
[2]
Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007). An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems (pp. 1–8).
[3]
Adamski, I., Adamski, R., Grel, T., Jędrych, A., Kaczmarek, K., & Michalewski, H. (2018). Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes. arXiv preprint arXiv:1801.02852.
[4]
Adamski, R., Grel, T., Klimek, M., & Michalewski, H. (2017). Atari games and intel processors. Workshop on Computer Games (pp. 1–18). Springer.
[5]
Agarwal, A., & Duchi, J.C. (2011). Distributed delayed stochastic optimization. In Neural Information Processing Systems.
[6]
Ba, J., Grosse, R., & Martens, J. (2016). Distributed second-order optimization using kronecker-factored approximations.
[7]
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., & Kautz, J. (2016). Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256.
[8]
Bellemare MG, Naddaf Y, Veness J, and Bowling M The arcade learning environment: An evaluation platform for general agents Journal of Artificial Intelligence Research 2013 47 253-279
[9]
Blei DM, Kucukelbir A, and McAuliffe JD Variational inference: A review for statisticians Journal of the American Statistical Association 2017 112 859-877
[10]
Blei DM, Ng AY, and Jordan MI Latent dirichlet allocation Journal of Machine Learning research 2003 3 993-1022
[11]
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.
[12]
Bottou L, Curtis FE, and Nocedal J Optimization methods for large-scale machine learning SIAM Review 2018 60 2 223-311
[13]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
[14]
Chilimbi, T. M., Suzue, Y., Apacible, J., & Kalyanaraman, K. (2014). Project adam: Building an efficient and scalable deep learning training system. In OSDI.
[15]
Clemente, A.V., Castejón, H.N., & Chandra, A. (2017). Efficient parallel methods for deep reinforcement learning. arXiv preprint arXiv:1705.04862.
[16]
Crane, R., & Roosta, F. (2019). Dingo: Distributed newton-type method for gradient-norm optimization. arXiv preprint arXiv:1901.05134.
[17]
De Sa C, Zhang C, Olukotun K, and Ré C Taming the wild: A unified analysis of hogwild!-style algorithms Advances in Neural Information Processing Systems 2015 28 2656
[18]
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., & Le, Q.V., et al. (2012). Large scale distributed deep networks. In Advances in neural information processing systems (pp. 1223–1231).
[19]
Dekel O, Gilad-Bachrach R, Shamir O, and Xiao L Optimal distributed online prediction using mini-batches Journal of Machine Learning Research 2012 13 165-202
[20]
Duchi, J.C., Chaturapruek, S., & Ré, C. (2015). Asynchronous stochastic convex optimization. arXiv preprint arXiv:1508.00882.
[21]
Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, and Aggarwal V Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning IEEE Transactions on Communications 2020 69 164-181
[22]
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., & Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.
[23]
Fang, C., & Lin, Z. (2017). Parallel asynchronous stochastic variance reduction for nonconvex optimization. In AAAI.
[24]
Ghadimi S and Lan G Stochastic first-and zeroth-order methods for nonconvex stochastic programming SIAM Journal on Optimization 2013 23 2341-2368
[25]
Hoffman MD, Blei DM, Wang C, and Paisley J Stochastic variational inference Journal of Machine Learning Research 2013 14 1303-1347
[26]
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.
[27]
Hsieh, C.-J., Yu, H.-F., & Dhillon, I. (2015). Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning (pp. 2370–2379). PMLR.
[28]
Huo, Z., & Huang, H. (2016). Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. arXiv preprint arXiv:1604.03584.
[29]
Huo, Z., & Huang, H. (2017). Asynchronous mini-batch gradient descent with variance reduction for non-convex optimization. In AAAI.
[30]
Ipek E, Mutlu O, Martínez JF, and Caruana R Self-optimizing memory controllers: A reinforcement learning approach ACM SIGARCH Computer Architecture News 2008 36 39-50
[31]
Jahani, M., He, X., Ma, C., Mokhtari, A., Mudigere, D., Ribeiro, A., & Takác, M. (2020a). Efficient distributed hessian free algorithm for large-scale empirical risk minimization via accumulating sample strategy. In International Conference on Artificial Intelligence and Statistics (pp. 2634–2644). PMLR.
[32]
Jahani, M., Nazari, M., Rusakov, S., Berahas, A. S., & Takáč, M. (2020b). Scaling up quasi-newton algorithms: Communication efficient distributed sr1. In International Conference on Machine Learning, Optimization, and Data Science (pp. 41–54). Springer.
[33]
Jordan MI, Ghahramani Z, Jaakkola TS, and Saul LK An introduction to variational methods for graphical models Machine Learning 1999 37 183-233
[34]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105).
[35]
Langford, J., Smola, A.J., & Zinkevich, M. (2009). Slow learners are fast. Neural Information Processing Systems.
[36]
Leblond, R., Pedregosa, F., & Lacoste-Julien, S. (2017). Asaga: asynchronous parallel saga. In Artificial Intelligence and Statistics (pp. 46–54). PMLR.
[37]
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., & Su, B.-Y. (2014a). Scaling distributed machine learning with the parameter server. In OSDI.
[38]
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014b). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670). ACM.
[39]
Lian, X., Huang, Y., Li, Y., & Liu, J. (2015). Asynchronous parallel stochastic gradient for nonconvex optimization. In Neural Information Processing Systems.
[40]
Lian, X., Zhang, W., Zhang, C., & Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning (pp. 3043–3052). PMLR.
[41]
Lichman, M. (2013). UCI machine learning repository.
[42]
Lin, T., Stich, S. U., Patel, K. K., & Jaggi, M. (2018). Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217.
[43]
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (pp. 1928–1937).
[44]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
[45]
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning Nature 2015 518 7540 529
[46]
Mohamad, S., Bouchachia, A., & Sayed-Mouchaweh, M. (2018). Asynchronous stochastic variational inference. arXiv preprint arXiv:1801.04289.
[47]
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., & Petersen, S., et al. (2015). Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296.
[48]
Neiswanger, W., Wang, C., & Xing, E. (2015). Embarrassingly parallel variational inference in nonconjugate models. arXiv preprint arXiv:1510.04163.
[49]
Nemirovski A, Juditsky A, Lan G, and Shapiro A Robust stochastic approximation approach to stochastic programming SIAM Journal on Optimization 2009 19 4 1574-1609
[50]
Niu, F., Recht, B., Ré, C., & Wright, S. J. (2011). Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730.
[51]
Ong, H. Y., Chavez, K., & Hong, A. (2015). Distributed deep q-learning. arXiv preprint arXiv:1508.04186.
[52]
Paine, T., Jin, H., Yang, J., Lin, Z., & Huang, T. (2013). Gpu asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186.
[53]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch.
[54]
Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems.
[55]
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2015). On variance reduction in stochastic gradient descent and its asynchronous variants. arXiv preprint arXiv:1506.06840.
[56]
Robbins H and Monro S A stochastic approximation method The Annals of Mathematical Statistics 1951 22 400-407
[57]
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
[58]
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
[59]
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning (pp. 1000–1008). PMLR.
[60]
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search Nature 2016 529 7587 484-489
[61]
Stich, S. U. (2018). Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767.
[62]
Stooke, A., & Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.
[63]
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). MIT Press.
[64]
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
[65]
Sutton RS, McAllester DA, Singh SP, and Mansour Y Policy gradient methods for reinforcement learning with function approximation Advances in Neural Information Processing Systems 2000 12 1057-1063
[66]
Theocharous, G., Thomas, P.S., & Ghavamzadeh, M. (2015). Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI (pp. 1806–1812).
[67]
Tsitsiklis J, Bertsekas D, and Athans M Distributed asynchronous deterministic and stochastic gradient optimization algorithms IEEE Transactions on Automatic Control 1986 31 9 803-812
[68]
Wainwright, M.J., & Jordan, M.I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning.
[69]
Wang, J., Sahu, A. K., Yang, Z., Joshi, G., & Kar, S. (2019). Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC) (pp. 299–300). IEEE.
[70]
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning (pp. 5–32). Springer.
[71]
Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., & Yu, Y. (2015). Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data.
[72]
Yu H, Yang S, and Zhu S Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning Proceedings of the AAAI Conference on Artificial Intelligence 2019 33 5693-5700
[73]
Zhang, S., Choromanska, A. E., & LeCun, Y. (2015). Deep learning with elastic averaging sgd. In Neural Information Processing Systems.
[74]
Zhao, S.-Y., & Li, W.-J. (2016). Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In AAAI.
[75]
Zhao, S.-Y., Zhang, G.-D., & Li, W.-J. (2017). Lock-free optimization for non-convex problems. In AAAI.
[76]
Zhou, F., & Cong, G. (2017). On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012.
[77]
Zinkevich, M., Weimer, M., Li, L., & Smola, A. J. (2010). Parallelized stochastic gradient descent. In Neural Information Processing Systems.

Index Terms

  1. Scaling up stochastic gradient descent for non-convex optimisation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Machine Language
        Machine Language  Volume 111, Issue 11
        Nov 2022
        396 pages

        Publisher

        Kluwer Academic Publishers

        United States

        Publication History

        Published: 01 November 2022
        Accepted: 07 July 2022
        Revision received: 12 April 2022
        Received: 05 July 2020

        Author Tags

        1. Stochastic gradient descent
        2. Large scale non-convex optimisation
        3. Distributed and parallel computation
        4. Variational inference
        5. Deep reinforcement learning

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 16 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media