Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14487))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

540 Accesses

Abstract

Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. This paper examines the efficiency differences in distributed DNN training across optical and electrical interconnect systems using various all-reduce algorithms. We first explore the Ring and Recursive Doubling (RD) all-reduce algorithms in both systems, followed by formulating a communication cost model for these algorithms. Performance comparison is then carried out via extensive experiments. Our results reveal that, in 1024-node systems, the Ring algorithm outperforms the RD algorithm in optical and electrical interconnects when data transfer exceeds 64 MB and 1024 MB, respectively. We also find that both Ring and RD algorithms in optical interconnect systems reduce average communication time by around 75% compared to electrical interconnect systems across four different DNNs. Interestingly, the communication time of the RD algorithm, but not the Ring algorithm, reduces as the number of wavelengths increase in optical interconnects. These findings provide valuable insights into DNN training optimization across various interconnect systems and lay the groundwork for future related research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Freely scalable and reconfigurable optical hardware for deep learning

Article Open access 04 February 2021

Deep learning—a route to WDM high-speed optical networks

Article 05 August 2022

AINoC: New Interconnect for Future Deep Neural Network Accelerators

References

Khan, A.R., Kashif, M., Jhaveri, R.H., Raut, R., Saba, T., Bahaj, S.A.: Deep learning for intrusion detection and security of Internet of Things (IoT): current analysis, challenges, and possible solutions. Secur. Commun. Netw. 2022, 1–13 (2022)
Google Scholar
Luo, L., West, P., Nelson, J., Krishnamurthy, A., Ceze, L.: PLink: discovering and exploiting locality for accelerated distributed training on the public cloud. Proc. Mach. Learn. Syst. 2, 82–97 (2020)
Google Scholar
Wang, G., Venkataraman, S., Phanishayee, A., Devanur, N., Thelin, J., Stoica, I.: Blink: Fast and generic collectives for distributed ML. Proc. Mach. Learn. Syst. 2, 172–186 (2020)
Google Scholar
Yuichiro, U., Yokota, R.: Exhaustive study of hierarchical allreduce patterns for large messages between GPUs. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 430–439 (2019)
Google Scholar
Jiang, Y., Gu, H., Lu, Y., Yu, X.: 2D-HRA: two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning. IEEE Access 8, 183488–183494 (2020)
Article Google Scholar
Cho, M., Finkler, U., Serrano, M., Kung, D., Hunter, H.: BlueConnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM J. Res. Dev. 63(6), 1:1–1:11 (2019)
Google Scholar
Nguyen, T.T., Takano, R.: On the feasibility of hybrid electrical/optical switch architecture for large-scale training of distributed deep learning. In: 2019 IEEE/ACM Workshop on Photonics-Optics Technology Oriented Networking, Information and Computing Systems (PHOTONICS), pp. 7–14 (2019)
Google Scholar
Khani, M., et al.: SIP-ML: high-bandwidth optical network interconnects for machine learning training. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pp. 657–675 (2021)
Google Scholar
Gu, R., Qiao, Y., Ji, Y.: Optical or electrical interconnects: quantitative comparison from parallel computing performance view. In: 2008 IEEE Global Telecommunications Conference, IEEE GLOBECOM 2008, pp. 1–5 (2008)
Google Scholar
Shin, J., Seo, C.S., Chellappa, A., Brooke, M., Chatterjee, A., Jokerst, N.M.: Comparison of electrical and optical interconnect. In: IEEE Electronic Components and Technology Conference, pp. 1067–1072 (1999)
Google Scholar
Wei, J., et al.: Analyzing the impact of soft errors in VGG networks implemented on GPUs. Microelectron. Reliab. 110, 113648 (2020)
Article Google Scholar
Casanova, H., Legrand, A., Quinson, M.: SimGrid: a generic framework for large-scale distributed experiments. In: Tenth IEEE International Conference on Computer Modeling and Simulation, UKSim2008, pp. 126–131 (2008)
Google Scholar
Alotaibi, S.D., et al.: Deep Neural Network - based intrusion detection system through PCA. Math. Prob. Eng. 2022, 1–9 (2022)
Google Scholar
Huang, J., Majumder, P., Kim, S., Muzahid, A., Yum, K.H., Kim, E.J.: Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 181–194. IEEE (2021)
Google Scholar
Ghobadi, M.: Emerging optical interconnects for AI systems. In: IEEE 2022 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3 (2022)
Google Scholar
Dai, F., Chen, Y., Huang, Z., Zhang, H., Zhang, F.: Efficient all-reduce for distributed DNN training in optical interconnect systems. In: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 422–424 (2023)
Google Scholar
TensorFlow: Optimize TensorFlow performance using the Profiler (n.d.). https://www.tensorflow.org/guide/profiler. Accessed 2 Sept 2023
Wang, W., et al.: TopoOpt: co-optimizing network topology and parallelization strategy for distributed training jobs. In: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, pp. 739–767 (2023)
Google Scholar
Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: 2017 USENIX Annual Technical Conference, USENIX ATC 2017, pp. 181–193 (2017)
Google Scholar
Dai, F., Chen, Y., Huang, Z., Zhang, H., Zhang, H., Xia, C.: Comparing the performance of multi-layer perceptron training on electrical and optical network-on-chips. J. Supercomput. 79(10), 10725–10746 (2023)
Article Google Scholar
Ottino, A., Benjamin, J., Zervas, G.: RAMP: a flat nanosecond optical network and MPI operations for distributed deep learning systems. Opt. Switching Netw. 51, 100761 (2023)
Article Google Scholar
Dai, F., Chen, Y., Zhang, H., Huang, Z.: Accelerating fully connected neural network on optical network-on-chip (ONoC). arXiv preprint arXiv:2109.14878 (2021)
Xia, C., Chen, Y., Zhang, H., Zhang, H., Dai, F., Wu, J.: Efficient neural network accelerators with optical computing and communication. Comput. Sci. Inf. Syst. 20(1), 513–535 (2023)
Article Google Scholar

Download references

Acknowledgements

We thank the reviewers for taking the time and effort necessary to review the manuscript. Besides, we acknowledge using New Zealand eScience Infrastructure (NeSI) high-performance computing facilities as part of this research (Project code: uoo03633).

Author information

Authors and Affiliations

School of Computing, University of Otago, Dunedin, New Zealand
Fei Dai, Yawen Chen, Zhiyi Huang & Haibo Zhang
School of Computing, Eastern Institute of Technology | Te Pūkenga, Hawke’s Bay, Napier, New Zealand
Fei Dai
School of Information and Communication Technology, Griffith University, Brisbane, Australia
Hui Tian

Authors

Fei Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yawen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Dai .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dai, F., Chen, Y., Huang, Z., Zhang, H., Tian, H. (2024). Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14487. Springer, Singapore. https://doi.org/10.1007/978-981-97-0834-5_23

Download citation

DOI: https://doi.org/10.1007/978-981-97-0834-5_23
Published: 12 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0833-8
Online ISBN: 978-981-97-0834-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Freely scalable and reconfigurable optical hardware for deep learning

Deep learning—a route to WDM high-speed optical networks

AINoC: New Interconnect for Future Deep Neural Network Accelerators

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Freely scalable and reconfigurable optical hardware for deep learning

Deep learning—a route to WDM high-speed optical networks

AINoC: New Interconnect for Future Deep Neural Network Accelerators

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation