Article

Free access

LAG: lazily aggregated gradient for communication-efficient distributed learning

Authors:

Georgios B. Giannakis,

Wotao YinAuthors Info & Claims

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

Pages 5055 - 5065

Published: 03 December 2018 Publication History

PDF eReader Publisher Site

Abstract

This paper presents a new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation. Simple rules are designed to detect slowly-varying gradients and, therefore, trigger the reuse of outdated gradients. The resultant gradient-based algorithms are termed Lazily Aggregated Gradient — justifying our acronym LAG used henceforth. Theoretically, the merits of this contribution are: i) the convergence rate is the same as batch gradient descent in strongly-convex, convex, and nonconvex cases; and, ii) if the distributed datasets are heterogeneous (quantified by certain measurable constants), the communication rounds needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of lagged gradients. Numerical experiments on both synthetic and real data corroborate a significant communication reduction compared to alternatives.

References

[1]

A. Nedic and A. Ozdaglar, "Distributed subgradient methods for multi-agent optimization," IEEE Trans. Automat. Control, vol. 54, no. 1, pp. 48-61, Jan. 2009.

[2]

G. B. Giannakis, Q. Ling, G. Mateos, I. D. Schizas, and H. Zhu, "Decentralized Learning for Wireless Communications and Networking," in Splitting Methods in Communication and Imaging, Science and Engineering. New York: Springer, 2016.

[3]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., "Large scale distributed deep networks," in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, 2012, pp. 1223-1231.

Digital Library

[4]

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in Proc. Intl. Conf. Artificial Intell. and Stat., Fort Lauderdale, FL, Apr. 2017, pp. 1273-1282.

[5]

V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, "Federated multi-task learning," in Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 4427-4437.

Digital Library

[6]

I. Stoica, D. Song, R. A. Popa, D. Patterson, M. W. Mahoney, R. Katz, A. D. Joseph, M. Jordan, J. M. Hellerstein, J. E. Gonzalez et al., "A Berkeley view of systems challenges for AI," arXiv preprint:1712.05855, Dec. 2017.

[7]

L. Bottou, "Large-Scale Machine Learning with Stochastic Gradient Descent," in Proceedings of COMP-STAT'2010, Y. Lechevallier and G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177-186.

[8]

L. Bottou, F. E. Curtis, and J. Nocedal, "Optimization methods for large-scale machine learning," arXiv preprint:1606.04838, Jun. 2016.

[9]

R. Johnson and T. Zhang, "Accelerating stochastic gradient descent using predictive variance reduction," in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, NV, Dec. 2013, pp. 315-323.

Digital Library

[10]

A. Defazio, F. Bach, and S. Lacoste-Julien, "Saga: A fast incremental gradient method with support for non-strongly convex composite objectives," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 1646-1654.

Digital Library

[11]

M. Schmidt, N. Le Roux, and F. Bach, "Minimizing finite sums with the stochastic average gradient," Mathematical Programming, vol. 162, no. 1-2, pp. 83-112, Mar. 2017.

Digital Library

[12]

M. Li, D. G. Andersen, A. J. Smola, and K. Yu, "Communication efficient distributed machine learning with the parameter server," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 19-27.

Digital Library

[13]

B. McMahan and D. Ramage, "Federated learning: Collaborative machine learning without centralized training data," Google Research Blog, Apr. 2017. [Online]. Available: https://research.googleblog.com/2017/04/federated-learning-collaborative.html

[14]

L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, "Asynchronous parallel algorithms for nonconvex big-data optimization: Model and convergence," arXiv preprint:1607.04818, Jul. 2016.

[15]

T. Sun, R. Hannah, and W. Yin, "Asynchronous coordinate descent under more realistic assumptions," in Proc. Advances in Neural Info. Process. Syst., Long Beach, CA, Dec. 2017, pp. 6183-6191.

Digital Library

[16]

Z. Peng, Y. Xu, M. Yan, and W. Yin, "Arock: an algorithmic framework for asynchronous parallel coordinate updates," SIAM J. Sci. Comp., vol. 38, no. 5, pp. 2851-2879, Sep. 2016.

[17]

B. Recht, C. Re, S. Wright, and F. Niu, "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, Dec. 2011, pp. 693-701.

Digital Library

[18]

J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, "An asynchronous parallel stochastic coordinate descent algorithm," J. Machine Learning Res., vol. 16, no. 1, pp. 285-322, 2015.

Digital Library

[19]

X. Lian, Y. Huang, Y. Li, and J. Liu, "Asynchronous parallel stochastic gradient for nonconvex optimization," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2015, pp. 2737-2745.

Digital Library

[20]

M. I. Jordan, J. D. Lee, and Y. Yang, "Communication-efficient distributed statistical inference," J. American Statistical Association, vol. to appear, 2018.

[21]

Y. Zhang, J. C. Duchi, and M. J. Wainwright, "Communication-efficient algorithms for statistical optimization." J. Machine Learning Res., vol. 14, no. 11, 2013.

Digital Library

[22]

A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, "Distributed mean estimation with limited communication," in Proc. Intl. Conf. Machine Learn., Sydney, Australia, Aug. 2017, pp. 3329-3337.

Digital Library

[23]

M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan, "Communication-efficient distributed dual coordinate ascent," in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec. 2014, pp. 3068-3076.

Digital Library

[24]

C. Ma, J. Konečnỳ, M. Jaggi, V. Smith, M. I. Jordan, P. Richtárik, and M. Takáč, "Distributed optimization with arbitrary local solvers," Optimization Methods and Software, vol. 32, no. 4, pp. 813-848, Jul. 2017.

Digital Library

[25]

O. Shamir, N. Srebro, and T. Zhang, "Communication-efficient distributed optimization using an approximate newton-type method," in Proc. Intl. Conf. Machine Learn., Beijing, China, Jun. 2014, pp. 1000-1008.

Digital Library

[26]

Y. Zhang and X. Lin, "DiSCO: Distributed optimization for self-concordant empirical loss," in Proc. Intl. Conf. Machine Learn., Lille, France, Jun. 2015, pp. 362-370.

Digital Library

[27]

Y. Liu, C. Nowzari, Z. Tian, and Q. Ling, "Asynchronous periodic event-triggered coordination of multi-agent systems," in Proc. IEEE Conf. Decision Control, Melbourne, Australia, Dec. 2017, pp. 6696-6701.

[28]

G. Lan, S. Lee, and Y. Zhou, "Communication-efficient algorithms for decentralized and stochastic optimization," arXiv preprint:1701.03961, Jan. 2017.

[29]

Y. Nesterov, Introductory Lectures on Convex Optimization: A basic course. Berlin, Germany: Springer, 2013, vol. 87.

Digital Library

[30]

D. Blatt, A. O. Hero, and H. Gauchman, "A convergent incremental gradient method with a constant step size," SIAM J. Optimization, vol. 18, no. 1, pp. 29-51, Feb. 2007.

Digital Library

[31]

M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, "On the convergence rate of incremental aggregated gradient algorithms," SIAM J. Optimization, vol. 27, no. 2, pp. 1035-1048, Jun. 2017.

[32]

M. Lichman, "UCI machine learning repository," 2013. [Online]. Available: http://archive.ics.uci.edu/ml

[33]

L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo, "Supervised feature selection via dependence estimation," in Proc. Intl. Conf. Machine Learn., Corvallis, OR, Jun. 2007, pp. 823-830.

Digital Library

[34]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.

Cited By

Li WAssaad M(2019)Matrix Exponential Learning Schemes With Low Informational ExchangeIEEE Transactions on Signal Processing10.1109/TSP.2019.291287567:12(3140-3153)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1109/TSP.2019.2912875

Recommendations

On three-term conjugate gradient algorithms for unconstrained optimization

This paper presents a project for three-term conjugate gradient algorithms development. The search direction of the algorithms from this class has three terms and is computed as modifications of the classical conjugate gradient algorithms to satisfy ...
Globally Convergent Three-Term Conjugate Gradient Methods that Use Secant Conditions and Generate Descent Search Directions for Unconstrained Optimization

In this paper, we propose a three-term conjugate gradient method based on secant conditions for unconstrained optimization problems. Specifically, we apply the idea of Dai and Liao (in Appl. Math. Optim. 43: 87---101, 2001) to the three-term conjugate ...
Descent three-term conjugate gradient methods based on secant conditions for unconstrained optimization

The conjugate gradient method is an effective method for large-scale unconstrained optimization problems. Recent research has proposed conjugate gradient methods based on secant conditions to establish fast convergence of the methods. However, these ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

December 2018

11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)9

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li WAssaad M(2019)Matrix Exponential Learning Schemes With Low Informational ExchangeIEEE Transactions on Signal Processing10.1109/TSP.2019.291287567:12(3140-3153)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1109/TSP.2019.2912875

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents