Nothing Special   »   [go: up one dir, main page]

Skip to main content

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2023)

Abstract

Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. This paper examines the efficiency differences in distributed DNN training across optical and electrical interconnect systems using various all-reduce algorithms. We first explore the Ring and Recursive Doubling (RD) all-reduce algorithms in both systems, followed by formulating a communication cost model for these algorithms. Performance comparison is then carried out via extensive experiments. Our results reveal that, in 1024-node systems, the Ring algorithm outperforms the RD algorithm in optical and electrical interconnects when data transfer exceeds 64 MB and 1024 MB, respectively. We also find that both Ring and RD algorithms in optical interconnect systems reduce average communication time by around 75% compared to electrical interconnect systems across four different DNNs. Interestingly, the communication time of the RD algorithm, but not the Ring algorithm, reduces as the number of wavelengths increase in optical interconnects. These findings provide valuable insights into DNN training optimization across various interconnect systems and lay the groundwork for future related research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Khan, A.R., Kashif, M., Jhaveri, R.H., Raut, R., Saba, T., Bahaj, S.A.: Deep learning for intrusion detection and security of Internet of Things (IoT): current analysis, challenges, and possible solutions. Secur. Commun. Netw. 2022, 1–13 (2022)

    Google Scholar 

  2. Luo, L., West, P., Nelson, J., Krishnamurthy, A., Ceze, L.: PLink: discovering and exploiting locality for accelerated distributed training on the public cloud. Proc. Mach. Learn. Syst. 2, 82–97 (2020)

    Google Scholar 

  3. Wang, G., Venkataraman, S., Phanishayee, A., Devanur, N., Thelin, J., Stoica, I.: Blink: Fast and generic collectives for distributed ML. Proc. Mach. Learn. Syst. 2, 172–186 (2020)

    Google Scholar 

  4. Yuichiro, U., Yokota, R.: Exhaustive study of hierarchical allreduce patterns for large messages between GPUs. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 430–439 (2019)

    Google Scholar 

  5. Jiang, Y., Gu, H., Lu, Y., Yu, X.: 2D-HRA: two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning. IEEE Access 8, 183488–183494 (2020)

    Article  Google Scholar 

  6. Cho, M., Finkler, U., Serrano, M., Kung, D., Hunter, H.: BlueConnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM J. Res. Dev. 63(6), 1:1–1:11 (2019)

    Google Scholar 

  7. Nguyen, T.T., Takano, R.: On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning. In: 2019 IEEE/ACM Workshop on Photonics-Optics Technology Oriented Networking, Information and Computing Systems (PHOTONICS), pp. 7–14 (2019)

    Google Scholar 

  8. Khani, M., et al.: SIP-ML: high-bandwidth optical network interconnects for machine learning training. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 657–675 (2021)

    Google Scholar 

  9. Gu, R., Qiao, Y., Ji, Y.: Optical or electrical interconnects: quantitative comparison from parallel computing performance view. In: 2008 IEEE Global Telecommunications Conference, IEEE GLOBECOM 2008, pp. 1–5 (2008)

    Google Scholar 

  10. Shin, J., Seo, C.S., Chellappa, A., Brooke, M., Chatterjee, A., Jokerst, N.M.: Comparison of electrical and optical interconnect. In: IEEE Electronic Components and Technology Conference, pp. 1067–1072 (1999)

    Google Scholar 

  11. Wei, J., et al.: Analyzing the impact of soft errors in VGG networks implemented on GPUs. Microelectron. Reliab. 110, 113648 (2020)

    Article  Google Scholar 

  12. Casanova, H., Legrand, A., Quinson, M.: SimGrid: a generic framework for large-scale distributed experiments. In: Tenth IEEE International Conference on Computer Modeling and Simulation, UKSim2008, pp. 126–131 (2008)

    Google Scholar 

  13. Alotaibi, S.D., et al.: Deep Neural Network - based intrusion detection system through PCA. Math. Prob. Eng. 2022, 1–9 (2022)

    Google Scholar 

  14. Huang, J., Majumder, P., Kim, S., Muzahid, A., Yum, K.H., Kim, E.J.: Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 181–194. IEEE (2021)

    Google Scholar 

  15. Ghobadi, M.: Emerging optical interconnects for AI systems. In: IEEE 2022 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2022)

    Google Scholar 

  16. Dai, F., Chen, Y., Huang, Z., Zhang, H., Zhang, F.: Efficient all-reduce for distributed DNN training in optical interconnect systems. In: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 422–424 (2023)

    Google Scholar 

  17. TensorFlow: Optimize TensorFlow performance using the Profiler (n.d.). https://www.tensorflow.org/guide/profiler. Accessed 2 Sept 2023

  18. Wang, W., et al.: TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs. In: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, pp. 739–767 (2023)

    Google Scholar 

  19. Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: 2017 USENIX Annual Technical Conference, USENIX ATC 2017, pp. 181–193 (2017)

    Google Scholar 

  20. Dai, F., Chen, Y., Huang, Z., Zhang, H., Zhang, H., Xia, C.: Comparing the performance of multi-layer perceptron training on electrical and optical network-on-chips. J. Supercomput. 79(10), 10725–10746 (2023)

    Article  Google Scholar 

  21. Ottino, A., Benjamin, J., Zervas, G.: RAMP: a flat nanosecond optical network and MPI operations for distributed deep learning systems. Opt. Switching Netw. 51, 100761 (2023)

    Article  Google Scholar 

  22. Dai, F., Chen, Y., Zhang, H., Huang, Z.: Accelerating fully connected neural network on optical network-on-chip (ONoC). arXiv preprint arXiv:2109.14878 (2021)

  23. Xia, C., Chen, Y., Zhang, H., Zhang, H., Dai, F., Wu, J.: Efficient neural network accelerators with optical computing and communication. Comput. Sci. Inf. Syst. 20(1), 513–535 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for taking the time and effort necessary to review the manuscript. Besides, we acknowledge using New Zealand eScience Infrastructure (NeSI) high-performance computing facilities as part of this research (Project code: uoo03633).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Dai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dai, F., Chen, Y., Huang, Z., Zhang, H., Tian, H. (2024). Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14487. Springer, Singapore. https://doi.org/10.1007/978-981-97-0834-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0834-5_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0833-8

  • Online ISBN: 978-981-97-0834-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics