Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3097983.3098147acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Small Batch or Large Batch?: Gaussian Walk with Rebound Can Teach

Published: 13 August 2017 Publication History

Abstract

Efficiency of large-scale learning is a hot topic in both academic and industry. The stochastic gradient descent (SGD) algorithm, and its extension mini-batch SGD, allow the model to be updated without scanning the whole data set. However, the use of approximate gradient leads to the uncertainty issue, slowing down the decreasing of objective function. Furthermore, such uncertainty may result in a high frequency of meaningless update on the model, causing a communication issue in parallel learning environment. In this work, we develop a batch-adaptive stochastic gradient descent (BA-SGD) algorithm, which can dynamically choose a proper batch size as learning proceeds. Particularly on the basis of Taylor extension and central limit theorem, it models the decrease of objective value as a Gaussian random walk game with rebound. In this game, a heuristic strategy of determining batch size is adopted to maximize the utility of each incremental sampling. By evaluation on multiple real data sets, we demonstrate that by smartly choosing the batch size, the BA-SGD not only conserves the fast convergence of SGD algorithm but also avoids too frequent model updates.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[2]
Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123--132. ACM, 2012.
[3]
Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Machine Learning Repository.
[4]
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177--186. Springer, 2010.
[5]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1--122, 2011.
[6]
Caihua Chen, Bingsheng He, Yinyu Ye, and Xiaoming Yuan. The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming, 155(1--2):57--79, 2016.
[7]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223--1231, 2012.
[8]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[9]
John E Dennis Jr and Robert B Schnabel. Numerical methods for unconstrained optimization and nonlinear equations, volume 16. Siam, 1996.
[10]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121--2159, 2011.
[11]
The Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/core/, 2009.
[12]
The Apache Software Foundation. Mahout project. http://mahout.apache.org, 2012.
[13]
Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17--40, 1976.
[14]
Kevin Gimpel, Dipanjan Das, and Noah A Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 213--222. Association for Computational Linguistics, 2010.
[15]
Roland Glowinski and A Marroco. Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de dirichlet non linéaires. Revue franccaise d'automatique, informatique, recherche opérationnelle. Analyse numérique, 9(2):41--76, 1975.
[16]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223--1231, 2013.
[17]
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315--323, 2013.
[18]
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[19]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, 2014.
[20]
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661--670. ACM, 2014.
[21]
M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
[22]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716--727, 2012.
[23]
Brendan McMahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems, pages 2915--2923, 2014.
[24]
Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22--31, 2014.
[25]
Noboru Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pages 63--92, 1998.
[26]
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages 543--547, 1983.
[27]
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992.
[28]
Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145--151, 1999.
[29]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[30]
Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663--2671, 2012.
[31]
Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 378--385, 2013.
[32]
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567--599, 2013.
[33]
Evan R Sparks, Ameet Talwalkar, Valton Smith, Jey Kottalam, Xinghao Pan, Jose Gonzalez, Michael J Franklin, Michael I Jordan, and Tim Kraska. Mli: An api for distributed machine learning. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 1187--1192. IEEE, 2013.
[34]
Daniel Vainsencher, Han Liu, and Tong Zhang. Local smoothness in variance reduced optimization. In Advances in Neural Information Processing Systems, pages 2179--2187, 2015.
[35]
Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057--2075, 2014.
[36]
I-Cheng Yeh and Che-hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2):2473--2480, 2009.
[37]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 10:10--10, 2010.
[38]
Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
[39]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595--2603, 2010.

Cited By

View all
  • (2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
  • (2023)FedUR: Federated Learning Optimization Through Adaptive Centralized Learning OptimizersIEEE Transactions on Signal Processing10.1109/TSP.2023.329249771(2622-2637)Online publication date: 2023
  • (2022)A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoTIEEE Internet of Things Journal10.1109/JIOT.2021.31116249:2(939-963)Online publication date: 15-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. batch-adaptive sgd
  2. gaussian random walk
  3. rebound

Qualifiers

  • Research-article

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating Transfer Learning with Near-Data Computation on Cloud Object StoresProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698549(995-1011)Online publication date: 20-Nov-2024
  • (2023)FedUR: Federated Learning Optimization Through Adaptive Centralized Learning OptimizersIEEE Transactions on Signal Processing10.1109/TSP.2023.329249771(2622-2637)Online publication date: 2023
  • (2022)A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoTIEEE Internet of Things Journal10.1109/JIOT.2021.31116249:2(939-963)Online publication date: 15-Jan-2022
  • (2021)Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networksBMC Medical Informatics and Decision Making10.1186/s12911-021-01486-x21:1Online publication date: 13-Apr-2021
  • (2020)C olumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00134(1513-1524)Online publication date: Apr-2020
  • (2020)Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS49614.2020.00041(188-194)Online publication date: Aug-2020
  • (2020)An Accelerated Edge Cloud System for Energy Data Stream Processing Based on Adaptive Incremental Deep Learning SchemeIEEE Access10.1109/ACCESS.2020.30337718(195341-195358)Online publication date: 2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media