Nothing Special   »   [go: up one dir, main page]

Skip to main content

Communication-Efficient Distributed Optimization Algorithms

  • Chapter
  • First Online:
Federated Learning
  • 3233 Accesses

Abstract

In federated learning, the communication link connecting the edge parties with the central aggregator is sometimes bandwidth-limited and can have high network latency. Therefore, there is a critical need to design and deploy communication-efficient distributed training algorithms. In this chapter, we will review two orthogonal communication-efficient distributed stochastic gradient descent (SGD) methods: (1) local-update stochastic gradient descent (SGD), where clients make multiple local model updates that are periodically aggregated, and (2) gradient compression and sparsification methods to reduce the number of bits transmitted per update. In both these methods, there is a trade-off between the error convergence with respect to the number of iterations and the communication efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Albasyoni A, Safaryan M, Condat L, Richtárik P (2020) Optimal gradient compression for distributed and federated learning. arXiv preprint arXiv:2010.03246

    Google Scholar 

  2. Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In: Advances in neural information processing systems, pp 1709–1720

    Google Scholar 

  3. Basu D, Data D, Karakus C, Diggavi SN (2020) Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations. IEEE J Sel Areas Inf Theory 1 (1): 217–226

    Article  Google Scholar 

  4. Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838

    Google Scholar 

  5. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  Google Scholar 

  6. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3 (1): 1–122

    Article  Google Scholar 

  7. Chaudhari P, Soatto S (2017) Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. CoRR, abs/1710.11029. http://arxiv.org/abs/1710.11029

  8. Cho YJ, Wang J, Joshi G (2020) Client selection in federated learning: convergence analysis and power-of-choice selection strategies

    Google Scholar 

  9. Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Proceedings of the workshop on hot topics in operating systems

    Google Scholar 

  10. Cui H, Cipar J, Ho Q, Kim JK, Lee S, Kumar A, Wei J, Dai W, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2014) Exploiting bounded staleness to speed up big data analytics. In: Proceedings of the USENIX annual technical conference, pp 37–48

    Google Scholar 

  11. Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato M, Senior A, Tucker P, Yang K, Ng AY (2012) Large scale distributed deep networks. In: Proceedings of the international conference on neural information processing systems, pp 1223–1231

    Google Scholar 

  12. Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13 (1): 165–202

    MathSciNet  MATH  Google Scholar 

  13. Dutta S, Joshi G, Ghosh S, Dube P, Nagpurkar P (2018) Slow and stale gradients can win the race: error-runtime trade-offs in distributed SGD. In: International conference on artificial intelligence and statistics (AISTATS). https://arxiv.org/abs/1803.01113

  14. Frankle J, Carbin M (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International conference on learning representations

    Google Scholar 

  15. Gupta S, Zhang W, Wang F (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In: IEEE international conference on data mining (ICDM). IEEE, pp 171–180

    Google Scholar 

  16. Han P, Wang S, Leung KK (2020) Adaptive gradient sparsification for efficient federated learning: an online learning approach. In: 2020 IEEE 40th international conference on distributed computing systems (ICDCS), pp 300–310

    Google Scholar 

  17. Han S, Mao H, Dally WJ (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149

    Google Scholar 

  18. Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2 (3–4): 157–325. ISSN 2167-3888

    Article  Google Scholar 

  19. Ho Q, Cipar J, Cui H, Kim JK, Lee S, Gibbons PB, Gibson GA, Ganger GR, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Proceedings of the international conference on neural information processing systems, pp 1223–1231

    Google Scholar 

  20. Horváth S, Richtarik P (2021) A better alternative to error feedback for communication-efficient distributed learning. In: International conference on learning representations

    Google Scholar 

  21. Jee Cho Y, Gupta S, Joshi G, Yagan O (2020) Bandit-based communication-efficient client selection strategies for federated learning. In: Proceedings of the asilomar conference on signals, systems, and computers, pp 1066–1069. https://doi.org/10.1109/IEEECONF51394.2020.9443523

  22. Jiang Y, Wang S, Valls V, Ko BJ, Lee W-H, Leung KK, Tassiulas L (2019) Model pruning enables efficient federated learning on edge devices. arXiv preprint arXiv:1909.12326

    Google Scholar 

  23. Karimireddy SP, Kale S, Mohri M, Reddi SJ, Stich SU, Suresh AT (2019) SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378

    Google Scholar 

  24. Karimireddy SP, Rebjock Q, Stich S, Jaggi M (2019) Error feedback fixes SignSGD and other gradient compression schemes. In: International conference on machine learning. PMLR, pp 3252–3261

    Google Scholar 

  25. Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 661–670

    Google Scholar 

  26. Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V (2020) FedDANE: a federated newton-type method

    Google Scholar 

  27. Li X, Huang K, Yang W, Wang S, Zhang Z (2020) On the convergence of FedAvg on non-IID data. In: International conference on learning representations (ICLR). https://arxiv.org/abs/1907.02189

  28. Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. In: Proceedings of the international conference on neural information processing systems, pp 2737–2745

    Google Scholar 

  29. Lian X, Zhang W, Zhang C, Liu J (2018) Asynchronous decentralized parallel stochastic gradient descent. In: Proceedings of the 35th international conference on machine learning. Proceedings of machine learning research, vol 80. PMLR, pp 3043–3052. http://proceedings.mlr.press/v80/lian18a.html

  30. McMahan HB, Moore E, Ramage D, Hampson S, y Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: International conference on artificial intelligence and statistics (AISTATS). https://arxiv.org/abs/1602.05629

  31. Neyshabur B, Tomioka R, Salakhutdinov R, Srebro N (2017) Geometry of optimization and implicit regularization in deep learning. CoRR, abs/1705.03071. http://arxiv.org/abs/1705.03071

  32. Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1 (3): 127–239

    Article  Google Scholar 

  33. Recht B, Re C, Wright S, Niu F (2011) Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of the international conference on neural information processing systems, pp 693–701

    Google Scholar 

  34. Reisizadeh A, Mokhtari A, Hassani H, Jadbabaie A, Pedarsani R (2020) FedPAQ: a communication-efficient federated learning method with periodic averaging and quantization. In: International conference on artificial intelligence and statistics. PMLR, pp 2021–2031

    Google Scholar 

  35. Robbins H, Monro S (1951) A stochastic approximation method. In: The annals of mathematical statistics, pp 400–407

    Google Scholar 

  36. Ruan Y, Zhang X, Liang S-C, Joe-Wong C (2021) Towards flexible device participation in federated learning. In: Banerjee A, Fukumizu K (eds) Proceedings of the 24th international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 130. PMLR, pp 3403–3411. http://proceedings.mlr.press/v130/ruan21a.html

  37. Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747

    Google Scholar 

  38. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115 (3): 211–252

    Article  MathSciNet  Google Scholar 

  39. Sahu AK, Li T, Sanjabi M, Zaheer M, Talwalkar A, Smith V (2019) Federated optimization in heterogeneous networks. In: Proceedings of the machine learning and systems (MLSys) conference

    Google Scholar 

  40. Sahu AK, Li T, Sanjabi M, Zaheer M, Talwalkar A, Smith V (2019) Federated optimization for heterogeneous networks. https://arxiv.org/abs/1812.06127

  41. Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York

    Book  Google Scholar 

  42. Shwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. CoRR, abs/1703.00810. http://arxiv.org/abs/1703.00810

  43. Stich SU (2018) Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767

    Google Scholar 

  44. Stich SU, Cordonnier J-B, Jaggi M (2018) Sparsified SGD with memory. In: Advances in neural information processing systems, pp 4447–4458

    Google Scholar 

  45. Tang H, Yu C, Lian X, Zhang T, Liu J (2019) DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In: International conference on machine learning. PMLR, pp 6155–6165

    Google Scholar 

  46. Wang J, Joshi G (2018) Cooperative SGD: unifying temporal and spatial strategies for communication-efficient distributed SGD, preprint. https://arxiv.org/abs/1808.07576

  47. Wang J, Joshi G (2019) Adaptive communication strategies for best error-runtime trade-offs in communication-efficient distributed SGD. In: Proceedings of the SysML conference. https://arxiv.org/abs/1810.08313

  48. Wang J, Liang H, Joshi G (2020) Overlap local-SGD: an algorithmic approach to hide communication delays in distributed SGD. In: Proceedings of international conference on acoustics, speech, and signal processing (ICASSP)

    Google Scholar 

  49. Wang J, Liu Q, Liang H, Joshi G, Poor HV (2020) Tackling the objective inconsistency problem in heterogeneous federated optimization. In: Proceedings on neural information processing systems (NeurIPS). https://arxiv.org/abs/2007.07481

  50. Wang J, Tantia V, Ballas N, Rabbat M (2020) SlowMo: improving communication-efficient distributed SGD with slow momentum. In: International conference on learning representations. https://openreview.net/forum?id=SkxJ8REYPH

  51. Wang J, Xu Z, Garrett Z, Charles Z, Liu L, Joshi G (2021) Local adaptivity in federated learning: convergence and consistency

    Google Scholar 

  52. Wang S, Tuor T, Salonidis T, Leung KK, Makaya C, He T, Chan K (2019) Adaptive federated learning in resource constrained edge computing systems. IEEE J Sel Areas Commun 37 (6): 1205–1221

    Article  Google Scholar 

  53. Yin D, Pananjady A, Lam M, Papailiopoulos D, Ramchandran K, Bartlett P (2018) Gradient diversity: a key ingredient for scalable distributed learning. In: Proceedings of the twenty-first international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 84. pp 1998–2007. http://proceedings.mlr.press/v84/yin18a.html

  54. Yu H, Yang S, Zhu S (2018) Parallel restarted SGD for non-convex optimization with faster convergence and less communication. arXiv preprint arXiv:1807.06629

    Google Scholar 

  55. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In: International conference on learning representations

    Google Scholar 

  56. Zhang J, Mitliagkas I, Re C (2017) Yellowfin and the art of momentum tuning. CoRR, arXiv:1706.03471. http://arxiv.org/abs/1706.03471

    Google Scholar 

  57. Zhang S, Choromanska AE, LeCun Y (2015) Deep learning with elastic averaging SGD. In: NIPS’15 proceedings of the 28th international conference on neural information processing systems, pp 685–693

    Google Scholar 

  58. Zhang W, Gupta S, Lian X, Liu J (2015) Staleness-aware Async-SGD for distributed deep learning. arXiv preprint arXiv:1511.05950

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiqiang Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Joshi, G., Wang, S. (2022). Communication-Efficient Distributed Optimization Algorithms. In: Ludwig, H., Baracaldo, N. (eds) Federated Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-96896-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96896-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96895-3

  • Online ISBN: 978-3-030-96896-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics