research-article

Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach

Authors:

Taiga NakamuraAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1275 - 1284

https://doi.org/10.1145/3097983.3098147

Published: 13 August 2017 Publication History

Abstract

Efficiency of large-scale learning is a hot topic in both academic and industry. The stochastic gradient descent (SGD) algorithm, and its extension mini-batch SGD, allow the model to be updated without scanning the whole data set. However, the use of approximate gradient leads to the uncertainty issue, slowing down the decreasing of objective function. Furthermore, such uncertainty may result in a high frequency of meaningless update on the model, causing a communication issue in parallel learning environment. In this work, we develop a batch-adaptive stochastic gradient descent (BA-SGD) algorithm, which can dynamically choose a proper batch size as learning proceeds. Particularly on the basis of Taylor extension and central limit theorem, it models the decrease of objective value as a Gaussian random walk game with rebound. In this game, a heuristic strategy of determining batch size is adopted to maximize the utility of each incremental sampling. By evaluation on multiple real data sets, we demonstrate that by smartly choosing the batch size, the BA-SGD not only conserves the fast convergence of SGD algorithm but also avoids too frequent model updates.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[2]

Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123--132. ACM, 2012.

Digital Library

[3]

Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Machine Learning Repository.

[4]

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177--186. Springer, 2010.

[5]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1--122, 2011.

Digital Library

[6]

Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan. The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1--2):57--79, 2016.

[7]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223--1231, 2012.

Digital Library

[8]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[9]

John E Dennis Jr and Robert B Schnabel. Numerical methods for unconstrained optimization and nonlinear equations, volume 16. Siam, 1996.

[10]

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121--2159, 2011.

Digital Library

[11]

The Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/core/, 2009.

[12]

The Apache Software Foundation. Mahout project. http://mahout.apache.org, 2012.

[13]

Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17--40, 1976.

[14]

Kevin Gimpel, Dipanjan Das, and Noah A Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 213--222. Association for Computational Linguistics, 2010.

Digital Library

[15]

Roland Glowinski and A Marroco. Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de dirichlet non linéaires. Revue franccaise d'automatique, informatique, recherche opérationnelle. Analyse numérique, 9(2):41--76, 1975.

[16]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223--1231, 2013.

[17]

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315--323, 2013.

Digital Library

[18]

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[19]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, 2014.

Digital Library

[20]

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661--670. ACM, 2014.

Digital Library

[21]

M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.

[22]

Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716--727, 2012.

Digital Library

[23]

Brendan McMahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems, pages 2915--2923, 2014.

Digital Library

[24]

Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22--31, 2014.

[25]

Noboru Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pages 63--92, 1998.

[26]

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages 543--547, 1983.

[27]

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992.

Digital Library

[28]

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145--151, 1999.

Digital Library

[29]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.

Digital Library

[30]

Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663--2671, 2012.

Digital Library

[31]

Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 378--385, 2013.

Digital Library

[32]

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567--599, 2013.

[33]

Evan R Sparks, Ameet Talwalkar, Valton Smith, Jey Kottalam, Xinghao Pan, Jose Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. Mli: An api for distributed machine learning. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 1187--1192. IEEE, 2013.

[34]

Daniel Vainsencher, Han Liu, and Tong Zhang. Local smoothness in variance reduced optimization. In Advances in Neural Information Processing Systems, pages 2179--2187, 2015.

[35]

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057--2075, 2014.

Digital Library

[36]

I-Cheng Yeh and Che-hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2):2473--2480, 2009.

Digital Library

[37]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010.

Digital Library

[38]

Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

[39]

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595--2603, 2010.

Digital Library

Cited By

Petrescu DGuirguis AQuoc DPicorel JGuerraoui RDinu F(2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698549
Zhang HZeng KLin S(2023)FedUR: Federated Learning Optimization Through Adaptive Centralized Learning OptimizersIEEE Transactions on Signal Processing10.1109/TSP.2023.329249771(2622-2637)Online publication date: 2023
https://doi.org/10.1109/TSP.2023.3292497
Wang HQu ZZhou QZhang HLuo BXu WGuo SLi R(2022)A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoTIEEE Internet of Things Journal10.1109/JIOT.2021.31116249:2(939-963)Online publication date: 15-Jan-2022
https://doi.org/10.1109/JIOT.2021.3111624
Show More Cited By

Index Terms

Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach
1. Theory of computation
  1. Design and analysis of algorithms
    1. Mathematical optimization
      1. Continuous optimization
        Stochastic control and optimization
      2. Discrete optimization
        Optimization with randomized search heuristics
  2. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

On Large-Batch Training of Residual Networks with SignSGD
ICAAI '21: Proceedings of the 5th International Conference on Advances in Artificial Intelligence

Large-batch training of deep neural networks (DNN) has recently been widely studied, since traversing the optimization landscape is faster with large batches and the emergence of parallel computing has made large-batch training feasible. However, its ...
Efficient mini-batch training for stochastic optimization
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch ...
Accelerated mini-batch stochastic dual coordinate ascent
NIPS'13: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1

Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
604
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Petrescu DGuirguis AQuoc DPicorel JGuerraoui RDinu F(2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698549
Zhang HZeng KLin S(2023)FedUR: Federated Learning Optimization Through Adaptive Centralized Learning OptimizersIEEE Transactions on Signal Processing10.1109/TSP.2023.329249771(2622-2637)Online publication date: 2023
https://doi.org/10.1109/TSP.2023.3292497
Wang HQu ZZhou QZhang HLuo BXu WGuo SLi R(2022)A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoTIEEE Internet of Things Journal10.1109/JIOT.2021.31116249:2(939-963)Online publication date: 15-Jan-2022
https://doi.org/10.1109/JIOT.2021.3111624
Ma JFeng ZWu JZhang YDi W(2021)Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networksBMC Medical Informatics and Decision Making10.1186/s12911-021-01486-x21:1Online publication date: 13-Apr-2021
https://doi.org/10.1186/s12911-021-01486-x
Zhang ZWu WJiang JYu LCui BZhang C(2020)C olumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00134(1513-1524)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00134
Tyagi SSharma P(2020)Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS49614.2020.00041(188-194)Online publication date: Aug-2020
https://doi.org/10.1109/ACSOS49614.2020.00041
Kim SLee CYoun C(2020)An Accelerated Edge Cloud System for Energy Data Stream Processing Based on Adaptive Incremental Deep Learning SchemeIEEE Access10.1109/ACCESS.2020.30337718(195341-195358)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3033771

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten