Abstract
In federated learning, the communication link connecting the edge parties with the central aggregator is sometimes bandwidth-limited and can have high network latency. Therefore, there is a critical need to design and deploy communication-efficient distributed training algorithms. In this chapter, we will review two orthogonal communication-efficient distributed stochastic gradient descent (SGD) methods: (1) local-update stochastic gradient descent (SGD), where clients make multiple local model updates that are periodically aggregated, and (2) gradient compression and sparsification methods to reduce the number of bits transmitted per update. In both these methods, there is a trade-off between the error convergence with respect to the number of iterations and the communication efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Albasyoni A, Safaryan M, Condat L, Richtárik P (2020) Optimal gradient compression for distributed and federated learning. arXiv preprint arXiv:2010.03246
Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In: Advances in neural information processing systems, pp 1709–1720
Basu D, Data D, Karakus C, Diggavi SN (2020) Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations. IEEE J Sel Areas Inf Theory 1 (1): 217–226
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3 (1): 1–122
Chaudhari P, Soatto S (2017) Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. CoRR, abs/1710.11029. http://arxiv.org/abs/1710.11029
Cho YJ, Wang J, Joshi G (2020) Client selection in federated learning: convergence analysis and power-of-choice selection strategies
Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Proceedings of the workshop on hot topics in operating systems
Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A, Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2014) Exploiting bounded staleness to speed up big data analytics. In: Proceedings of the USENIX annual technical conference, pp 37–48
Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato M, Senior A, Tucker P, Yang K, Ng AY (2012) Large scale distributed deep networks. In: Proceedings of the international conference on neural information processing systems, pp 1223–1231
Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13 (1): 165–202
Dutta S, Joshi G, Ghosh S, Dube P, Nagpurkar P (2018) Slow and stale gradients can win the race: error-runtime trade-offs in distributed SGD. In: International conference on artificial intelligence and statistics (AISTATS). https://arxiv.org/abs/1803.01113
Frankle J, Carbin M (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International conference on learning representations
Gupta S, Zhang W, Wang F (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In: IEEE international conference on data mining (ICDM). IEEE, pp 171–180
Han P, Wang S, Leung KK (2020) Adaptive gradient sparsification for efficient federated learning: an online learning approach. In: 2020 IEEE 40th international conference on distributed computing systems (ICDCS), pp 300–310
Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149
Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2 (3–4): 157–325. ISSN 2167-3888
Ho Q, Cipar J, Cui H, Kim JK, Lee S, Gibbons PB, Gibson GA, Ganger GR, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Proceedings of the international conference on neural information processing systems, pp 1223–1231
Horváth S, Richtarik P (2021) A better alternative to error feedback for communication-efficient distributed learning. In: International conference on learning representations
Jee Cho Y, Gupta S, Joshi G, Yagan O (2020) Bandit-based communication-efficient client selection strategies for federated learning. In: Proceedings of the asilomar conference on signals, systems, and computers, pp 1066–1069. https://doi.org/10.1109/IEEECONF51394.2020.9443523
Jiang Y, Wang S, Valls V, Ko BJ, Lee W-H, Leung KK, Tassiulas L (2019) Model pruning enables efficient federated learning on edge devices. arXiv preprint arXiv:1909.12326
Karimireddy SP, Kale S, Mohri M, Reddi SJ, Stich SU, Suresh AT (2019) SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378
Karimireddy SP, Rebjock Q, Stich S, Jaggi M (2019) Error feedback fixes SignSGD and other gradient compression schemes. In: International conference on machine learning. PMLR, pp 3252–3261
Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 661–670
Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) FedDANE: a federated newton-type method
Li X, Huang K, Yang W, Wang S, Zhang Z (2020) On the convergence of FedAvg on non-IID data. In: International conference on learning representations (ICLR). https://arxiv.org/abs/1907.02189
Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. In: Proceedings of the international conference on neural information processing systems, pp 2737–2745
Lian X, Zhang W, Zhang C, Liu J (2018) Asynchronous decentralized parallel stochastic gradient descent. In: Proceedings of the 35th international conference on machine learning. Proceedings of machine learning research, vol 80. PMLR, pp 3043–3052. http://proceedings.mlr.press/v80/lian18a.html
McMahan HB, Moore E, Ramage D, Hampson S, y Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: International conference on artificial intelligence and statistics (AISTATS). https://arxiv.org/abs/1602.05629
Neyshabur B, Tomioka R, Salakhutdinov R, Srebro N (2017) Geometry of optimization and implicit regularization in deep learning. CoRR, abs/1705.03071. http://arxiv.org/abs/1705.03071
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1 (3): 127–239
Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the international conference on neural information processing systems, pp 693–701
Reisizadeh A, Mokhtari A, Hassani H, Jadbabaie A, Pedarsani R (2020) FedPAQ: a communication-efficient federated learning method with periodic averaging and quantization. In: International conference on artificial intelligence and statistics. PMLR, pp 2021–2031
Robbins H, Monro S (1951) A stochastic approximation method. In: The annals of mathematical statistics, pp 400–407
Ruan Y, Zhang X, Liang S-C, Joe-Wong C (2021) Towards flexible device participation in federated learning. In: Banerjee A, Fukumizu K (eds) Proceedings of the 24th international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 130. PMLR, pp 3403–3411. http://proceedings.mlr.press/v130/ruan21a.html
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115 (3): 211–252
Sahu AK, Li T, Sanjabi M, Zaheer M, Talwalkar A, Smith V (2019) Federated optimization in heterogeneous networks. In: Proceedings of the machine learning and systems (MLSys) conference
Sahu AK, Li T, Sanjabi M, Zaheer M, Talwalkar A, Smith V (2019) Federated optimization for heterogeneous networks. https://arxiv.org/abs/1812.06127
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York
Shwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. CoRR, abs/1703.00810. http://arxiv.org/abs/1703.00810
Stich SU (2018) Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767
Stich SU, Cordonnier J-B, Jaggi M (2018) Sparsified SGD with memory. In: Advances in neural information processing systems, pp 4447–4458
Tang H, Yu C, Lian X, Zhang T, Liu J (2019) DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In: International conference on machine learning. PMLR, pp 6155–6165
Wang J, Joshi G (2018) Cooperative SGD: unifying temporal and spatial strategies for communication-efficient distributed SGD, preprint. https://arxiv.org/abs/1808.07576
Wang J, Joshi G (2019) Adaptive communication strategies for best error-runtime trade-offs in communication-efficient distributed SGD. In: Proceedings of the SysML conference. https://arxiv.org/abs/1810.08313
Wang J, Liang H, Joshi G (2020) Overlap local-SGD: an algorithmic approach to hide communication delays in distributed SGD. In: Proceedings of international conference on acoustics, speech, and signal processing (ICASSP)
Wang J, Liu Q, Liang H, Joshi G, Poor HV (2020) Tackling the objective inconsistency problem in heterogeneous federated optimization. In: Proceedings on neural information processing systems (NeurIPS). https://arxiv.org/abs/2007.07481
Wang J, Tantia V, Ballas N, Rabbat M (2020) SlowMo: improving communication-efficient distributed SGD with slow momentum. In: International conference on learning representations. https://openreview.net/forum?id=SkxJ8REYPH
Wang J, Xu Z, Garrett Z, Charles Z, Liu L, Joshi G (2021) Local adaptivity in federated learning: convergence and consistency
Wang S, Tuor T, Salonidis T, Leung KK, Makaya C, He T, Chan K (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE J Sel Areas Commun 37 (6): 1205–1221
Yin D, Pananjady A, Lam M, Papailiopoulos D, Ramchandran K, Bartlett P (2018) Gradient diversity: a key ingredient for scalable distributed learning. In: Proceedings of the twenty-first international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 84. pp 1998–2007. http://proceedings.mlr.press/v84/yin18a.html
Yu H, Yang S, Zhu S (2018) Parallel restarted SGD for non-convex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: International conference on learning representations
Zhang J, Mitliagkas I, Re C (2017) Yellowfin and the art of momentum tuning. CoRR, arXiv:1706.03471. http://arxiv.org/abs/1706.03471
Zhang S, Choromanska AE, LeCun Y (2015) Deep learning with elastic averaging SGD. In: NIPS’15 proceedings of the 28th international conference on neural information processing systems, pp 685–693
Zhang W, Gupta S, Lian X, Liu J (2015) Staleness-aware Async-SGD for distributed deep learning. arXiv preprint arXiv:1511.05950
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Joshi, G., Wang, S. (2022). Communication-Efficient Distributed Optimization Algorithms. In: Ludwig, H., Baracaldo, N. (eds) Federated Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-96896-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-96896-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96895-3
Online ISBN: 978-3-030-96896-0
eBook Packages: Computer ScienceComputer Science (R0)