research-article

Lotus: A New Topology for Large-scale Distributed Machine Learning

Authors:

Krishnendu ChakrabartyAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems (JETC), Volume 17, Issue 1

Article No.: 7, Pages 1 - 21

https://doi.org/10.1145/3415749

Published: 17 September 2020 Publication History

Abstract

Machine learning is at the heart of many services provided by data centers. To improve the performance of machine learning, several parameter (gradient) synchronization methods have been proposed in the literature. These synchronization algorithms have different communication characteristics and accordingly place different demands on the network architecture. However, traditional data-center networks cannot easily meet these demands. Therefore, we analyze the communication profiles associated with several common synchronization algorithms and propose a machine learning--oriented network architecture to match their characteristics. The proposed design, named Lotus, because it looks like a lotus flower, is a hybrid optical/electrical architecture based on arrayed waveguide grating routers (AWGRs). In Lotus, a complete bipartite graph is used within the group to improve bisection bandwidth and scalability. Each pair of groups is connected by an optical link, and AWGRs between adjacent groups enhance path diversity and network reliability. We also present an efficient routing algorithm to make full use of the path diversity of Lotus, which leads to a further increase in network performance. Simulation results show that the network performance of Lotus is better than Dragonfly and 3D-Torus under realistic traffic patterns for different synchronization algorithms.

References

[1]

[Online]. Retrieved from https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf.

[2]

[Online]. Retrieved from https://item.jd.com/10448410875.html.

[3]

[Online]. Retrieved from http://www.lusterinc.com/.

[4]

[Online]. Retrieved from https://www.finisar.com/.

[5]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Savannah, GA, 265--283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.

Digital Library

[6]

Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K. Panda. 2018. Optimized broadcast for deep learning workloads on dense-GPU infiniband clusters: MPI or NCCL? In Proceedings of the 25th European MPI Users’ Group Meeting (EuroMPI’18). ACM, New York, NY, Article 2, 9 pages.

[7]

P. Bakopoulos, K. Christodoulopoulos, G. Landi, M. Aziz, E. Zahavi, D. Gallico, R. Pitwon, K. Tokas, I. Patronas, M. Capitani, C. Spatharakis, K. Yiannopoulos, K. Wang, K. Kontodimas, I. Lazarou, P. Wieder, D. I. Reisis, E. M. Varvarigos, M. Biancani, and H. Avramopoulos. 2018. NEPHELE: An end-to-end scalable and dynamically reconfigurable optical architecture for application-aware SDN cloud data centers. IEEE Commun. Mag. 56, 2 (2018), 178--188.

Digital Library

[8]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (Mar. 2003), 1137--1155. Retrieved from http://dl.acm.org/citation.cfm?id=944919.944966.

Digital Library

[9]

Madeleine Glick Cheng, Qixiang and Keren Bergman. 2020. Optical interconnection networks for high-performance systems. Optical Fiber Telecommunications VII. Academic Press (2020), 785--825. Retrieved from https://www.sciencedirect.com/science/article/pii/B9780128165027000208.

[10]

G. E. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 1 (Jan. 2012), 30--42.

Digital Library

[11]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.

[12]

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li. 2018. HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning. In Proceedings of the Workshop on Network Meets AI 8 ML (NetAI’18). ACM, New York, NY, 1--7.

Digital Library

[13]

Andrew Gibiansky. 2017. Bringing HPC techniques to deep learning.(2017). Retrieved from http://research.baidu.com/bringing-hpc-techniques-deep-learning.

[14]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[15]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training imagenet in 1 Hour. Retrieved from http://arxiv.org/abs/1706.02677.

[16]

P. Grani, R. Proietti, V. Akella, and S. J. B. Yoo. 2017. Design and evaluation of AWGR-based photonic noc architectures for 2.5d integrated high performance computing systems. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 289--300.

[17]

P. Grani, R. Proietti, S. Cheung, and S. J. Ben Yoo. 2016. Flat-topology high-throughput compute node with AWGR-based optical-interconnects. J. Lightwave Technol. 34, 12 (June 2016), 2959--2968.

[18]

Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. SIGCOMM Comput. Commun. Rev. 40, 4 (Aug. 2010), 195--206.

Digital Library

[19]

K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 620--629.

[20]

Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5353--5360.

[21]

S. Horiguchi and T. Ooki. 2000. Hierarchical 3D-torus interconnection network. In Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN’00). 50--56.

[22]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. Retrieved from http://arxiv.org/abs/1807.11205.

[23]

D. G. Kam, M. B. Ritter, T. J. Beukema, J. F. Bulzacchelli, P. K. Pepeljugoski, Y. H. Kwark, L. Shan, X. Gu, C. W. Baks, R. A. John, G. Hougham, C. Schuster, R. Rimolo-Donadio, and B. Wu. 2009. Is 25 Gb/s on-board signaling viable? IEEE Trans. Adv. Packag. 32, 2 (May 2009), 328--344.

[24]

J. Kim, W. J. Dally, S. Scott, and D. Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In Proceedings of the International Symposium on Computer Architecture. 77--88.

[25]

Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY, Article 3, 16 pages.

Digital Library

[26]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceeding sof the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583--598. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu.

Digital Library

[27]

H. Mao, S. Yao, T. Tang, B. Li, J. Yao, and Y. Wang. 2018. Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Topics Comput. 6, 3 (July 2018), 417--431.

[28]

Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U.-Chupala, Yoshiki Tanaka, and Yuichi Kageyama. 2018. ImageNet/ResNet-50 training in 224 seconds. Retrieved from http://arxiv.org/abs/1811.05233.

[29]

OPNET Modeler. 2009. Opnet Technologies Inc. Retrieved from https://opnetprojects.com/opnet-network-simulator/.

[30]

Roberto Proietti, Zheng Cao, Christopher J. Nitta, Yuliang Li, and S. J. Ben Yoo. 2015. A scalable, low-latency, high-throughput, optical interconnect architecture based on arrayed waveguide grating routers. J. Lightwave Technol. 33, 4 (Feb. 2015), 911--920. Retrieved from http://jlt.osa.org/abstract.cfm?URI=jlt-33-4-911.

[31]

Baidu Research. 2017. baidu-allreduce. Retrieved from https://github.com/baidu-research/baidu-allreduce.

[32]

Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco. 2018. Can the network be the ai accelerator? In Proceedings of the Morning Workshop on In-Network Computing (NetCompute’18). ACM, New York, NY, 20--25.

Digital Library

[33]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arxiv.org/abs/1409.1556.

[34]

N. Singh and S. Rao. 2014. Ensemble learning for large-scale workload prediction. IEEE Trans. Emerg. Topics Comput. 2, 2 (June 2014), 149--165.

[35]

M. A. Taubenblatt. 2012. Optical interconnects for high-performance computing. J. Lightwave Technol. 30, 4 (Feb. 2012), 448--457.

[36]

Mowei Wang, Yong Cui, Shihan Xiao, Xin Wang, Dan Yang, Kai Chen, and Jun Zhu. 2018. Neural network meets DCN: Traffic-driven topology adaptation with deep learning. Proc. ACM Meas. Anal. Comput. Syst. 2, 2, Article 26 (June 2018), 25 pages.

Digital Library

[37]

S. Wang, D. Li, J. Geng, Y. Gu, and Y. Cheng. 2019. Impact of network topology on the performance of DML: Theoretical analysis and practical factors. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’19). 1729--1737.

[38]

Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. 2016. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 84--97.

Digital Library

[39]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (June 2015), 49--67.

[40]

Y. Yin, R. Proietti, X. Ye, C. J. Nitta, V. Akella, and S. J. B. Yoo. 2013. LIONS: An AWGR-based low-latency optical switch for high-performance computing and data centers. IEEE J. Select. Topics Quant. Electr. 19, 2 (2013), 3600409--3600409.

[41]

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image classification at supercomputer scale. Retrieved from http://arxiv.org/abs/1811.06992.

[42]

D. Yun, C. Q. Wu, and M. M. Zhu. 2017. Transport-support workflow composition and optimization for big data movement in high-performance networks. IEEE Trans. Parallel Distrib. Syst. 28, 12 (Dec. 2017), 3656--3670.

Cited By

Xia CChen YZhang HZhang HDai FWu J(2023)Efficient neural network accelerators with optical computing and communicationComputer Science and Information Systems10.2298/CSIS220131066X20:1(513-535)Online publication date: 2023
https://doi.org/10.2298/CSIS220131066X
Wang CYoshikane NElson DWakayama YSoma DBeppu STsuritani T(2023)Modoru: Clos nanosecond optical switching for distributed deep training [Invited]Journal of Optical Communications and Networking10.1364/JOCN.49930316:1(A40)Online publication date: 13-Dec-2023
https://doi.org/10.1364/JOCN.499303
Wang CYoshikane NBalasis FTsuritani T(2022)OSDLComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.109191214:COnline publication date: 4-Sep-2022
https://dl.acm.org/doi/10.1016/j.comnet.2022.109191
Show More Cited By

Index Terms

Lotus: A New Topology for Large-scale Distributed Machine Learning
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

A thermal-sensitive design of a 3D torus-based optical NoC architecture
Abstract
In order to overcome limitations of traditional electronic interconnects in terms of power efficiency and bandwidth density, optical networks-on-chip (NoC) based on silicon photonics have been proposed as an emerging on-chip ...
Highlights
- A thermal-sensitive design approach of a 3D torus-based optical NoC architecture is proposed.
CSquare

As the number of cores in a multicore chip increases, the kilo-core processor will be a trend in Network-on-Chip development. For such case, the network topology needs to scale effectively. In this paper, we propose a new scalable topology for kilo-core-...
Permutation Capability of Optical Multistage Interconnection Networks

In this paper, we study optical multistage interconnection networks (MINs). Advances in electro-optic technologies have made optical communication a promising networking choice to meet the increasing demands of high-performance computing/communication ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems

ACM Journal on Emerging Technologies in Computing Systems Volume 17, Issue 1

January 2021

232 pages

ISSN:1550-4832

EISSN:1550-4840

DOI:10.1145/3425108

Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 17 September 2020

Accepted: 01 August 2020

Revised: 01 June 2020

Received: 01 March 2020

Published in JETC Volume 17, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
Natural Science Foundation of Shaanxi Province for Distinguished Young Scholars
National Key R8D Program of China
The Youth Innovation Team of Shaanxi Universities
Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing
Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
481
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)6

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xia CChen YZhang HZhang HDai FWu J(2023)Efficient neural network accelerators with optical computing and communicationComputer Science and Information Systems10.2298/CSIS220131066X20:1(513-535)Online publication date: 2023
https://doi.org/10.2298/CSIS220131066X
Wang CYoshikane NElson DWakayama YSoma DBeppu STsuritani T(2023)Modoru: Clos nanosecond optical switching for distributed deep training [Invited]Journal of Optical Communications and Networking10.1364/JOCN.49930316:1(A40)Online publication date: 13-Dec-2023
https://doi.org/10.1364/JOCN.499303
Wang CYoshikane NBalasis FTsuritani T(2022)OSDLComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.109191214:COnline publication date: 4-Sep-2022
https://dl.acm.org/doi/10.1016/j.comnet.2022.109191
Wang SLi DZhang JLin WHan DFeldmann A(2020)CEFSProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies10.1145/3386367.3431307(136-148)Online publication date: 23-Nov-2020
https://dl.acm.org/doi/10.1145/3386367.3431307

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents