Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3327345.3327412guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article
Free access

LAG: lazily aggregated gradient for communication-efficient distributed learning

Published: 03 December 2018 Publication History

Abstract

This paper presents a new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation. Simple rules are designed to detect slowly-varying gradients and, therefore, trigger the reuse of outdated gradients. The resultant gradient-based algorithms are termed Lazily Aggregated Gradient — justifying our acronym LAG used henceforth. Theoretically, the merits of this contribution are: i) the convergence rate is the same as batch gradient descent in strongly-convex, convex, and nonconvex cases; and, ii) if the distributed datasets are heterogeneous (quantified by certain measurable constants), the communication rounds needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of lagged gradients. Numerical experiments on both synthetic and real data corroborate a significant communication reduction compared to alternatives.

References

[1]
A. Nedic and A. Ozdaglar, "Distributed subgradient methods for multi-agent optimization," IEEE Trans. Automat. Control, vol. 54, no. 1, pp. 48-61, Jan. 2009.
[2]
G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, and H. Zhu, "Decentralized Learning for Wireless Communications and Networking," in Splitting Methods in Communication and Imaging, Science and Engineering. New York: Springer, 2016.
[3]
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., "Large scale distributed deep networks," in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, 2012, pp. 1223-1231.
[4]
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in Proc. Intl. Conf. Artificial Intell. and Stat., Fort Lauderdale, FL, Apr. 2017, pp. 1273-1282.
[5]
V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, "Federated multi-task learning," in Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 4427-4437.
[6]
I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez et al., "A Berkeley view of systems challenges for AI," arXiv preprint:1712.05855, Dec. 2017.
[7]
L. Bottou, "Large-Scale Machine Learning with Stochastic Gradient Descent," in Proceedings of COMP-STAT'2010, Y. Lechevallier and G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177-186.
[8]
L. Bottou, F. E. Curtis, and J. Nocedal, "Optimization methods for large-scale machine learning," arXiv preprint:1606.04838, Jun. 2016.
[9]
R. Johnson and T. Zhang, "Accelerating stochastic gradient descent using predictive variance reduction," in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, Dec. 2013, pp. 315-323.
[10]
A. Defazio, F. Bach, and S. Lacoste-Julien, "Saga: A fast incremental gradient method with support for non-strongly convex composite objectives," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 1646-1654.
[11]
M. Schmidt, N. Le Roux, and F. Bach, "Minimizing finite sums with the stochastic average gradient," Mathematical Programming, vol. 162, no. 1-2, pp. 83-112, Mar. 2017.
[12]
M. Li, D. G. Andersen, A. J. Smola, and K. Yu, "Communication efficient distributed machine learning with the parameter server," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 19-27.
[13]
B. McMahan and D. Ramage, "Federated learning: Collaborative machine learning without centralized training data," Google Research Blog, Apr. 2017. [Online]. Available: https://research.googleblog.com/2017/04/federated-learning-collaborative.html
[14]
L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, "Asynchronous parallel algorithms for nonconvex big-data optimization: Model and convergence," arXiv preprint:1607.04818, Jul. 2016.
[15]
T. Sun, R. Hannah, and W. Yin, "Asynchronous coordinate descent under more realistic assumptions," in Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 6183-6191.
[16]
Z. Peng, Y. Xu, M. Yan, and W. Yin, "Arock: an algorithmic framework for asynchronous parallel coordinate updates," SIAM J. Sci. Comp., vol. 38, no. 5, pp. 2851-2879, Sep. 2016.
[17]
B. Recht, C. Re, S. Wright, and F. Niu, "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, Dec. 2011, pp. 693-701.
[18]
J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, "An asynchronous parallel stochastic coordinate descent algorithm," J. Machine Learning Res., vol. 16, no. 1, pp. 285-322, 2015.
[19]
X. Lian, Y. Huang, Y. Li, and J. Liu, "Asynchronous parallel stochastic gradient for nonconvex optimization," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2015, pp. 2737-2745.
[20]
M. I. Jordan, J. D. Lee, and Y. Yang, "Communication-efficient distributed statistical inference," J. American Statistical Association, vol. to appear, 2018.
[21]
Y. Zhang, J. C. Duchi, and M. J. Wainwright, "Communication-efficient algorithms for statistical optimization." J. Machine Learning Res., vol. 14, no. 11, 2013.
[22]
A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, "Distributed mean estimation with limited communication," in Proc. Intl. Conf. Machine Learn., Sydney, Australia, Aug. 2017, pp. 3329-3337.
[23]
M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, "Communication-efficient distributed dual coordinate ascent," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 3068-3076.
[24]
C. Ma, J. Konečnỳ, M. Jaggi, V. Smith, M. I. Jordan, P. Richtárik, and M. Takáč, "Distributed optimization with arbitrary local solvers," Optimization Methods and Software, vol. 32, no. 4, pp. 813-848, Jul. 2017.
[25]
O. Shamir, N. Srebro, and T. Zhang, "Communication-efficient distributed optimization using an approximate newton-type method," in Proc. Intl. Conf. Machine Learn., Beijing, China, Jun. 2014, pp. 1000-1008.
[26]
Y. Zhang and X. Lin, "DiSCO: Distributed optimization for self-concordant empirical loss," in Proc. Intl. Conf. Machine Learn., Lille, France, Jun. 2015, pp. 362-370.
[27]
Y. Liu, C. Nowzari, Z. Tian, and Q. Ling, "Asynchronous periodic event-triggered coordination of multi-agent systems," in Proc. IEEE Conf. Decision Control, Melbourne, Australia, Dec. 2017, pp. 6696-6701.
[28]
G. Lan, S. Lee, and Y. Zhou, "Communication-efficient algorithms for decentralized and stochastic optimization," arXiv preprint:1701.03961, Jan. 2017.
[29]
Y. Nesterov, Introductory Lectures on Convex Optimization: A basic course. Berlin, Germany: Springer, 2013, vol. 87.
[30]
D. Blatt, A. O. Hero, and H. Gauchman, "A convergent incremental gradient method with a constant step size," SIAM J. Optimization, vol. 18, no. 1, pp. 29-51, Feb. 2007.
[31]
M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, "On the convergence rate of incremental aggregated gradient algorithms," SIAM J. Optimization, vol. 27, no. 2, pp. 1035-1048, Jun. 2017.
[32]
M. Lichman, "UCI machine learning repository," 2013. [Online]. Available: http://archive.ics.uci.edu/ml
[33]
L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, "Supervised feature selection via dependence estimation," in Proc. Intl. Conf. Machine Learn., Corvallis, OR, Jun. 2007, pp. 823-830.
[34]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.

Cited By

View all
  • (2019)Matrix Exponential Learning Schemes With Low Informational ExchangeIEEE Transactions on Signal Processing10.1109/TSP.2019.291287567:12(3140-3153)Online publication date: 1-Jun-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems
December 2018
11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)9
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Matrix Exponential Learning Schemes With Low Informational ExchangeIEEE Transactions on Signal Processing10.1109/TSP.2019.291287567:12(3140-3153)Online publication date: 1-Jun-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media