Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Lotus: A New Topology for Large-scale Distributed Machine Learning

Published: 17 September 2020 Publication History

Abstract

Machine learning is at the heart of many services provided by data centers. To improve the performance of machine learning, several parameter (gradient) synchronization methods have been proposed in the literature. These synchronization algorithms have different communication characteristics and accordingly place different demands on the network architecture. However, traditional data-center networks cannot easily meet these demands. Therefore, we analyze the communication profiles associated with several common synchronization algorithms and propose a machine learning--oriented network architecture to match their characteristics. The proposed design, named Lotus, because it looks like a lotus flower, is a hybrid optical/electrical architecture based on arrayed waveguide grating routers (AWGRs). In Lotus, a complete bipartite graph is used within the group to improve bisection bandwidth and scalability. Each pair of groups is connected by an optical link, and AWGRs between adjacent groups enhance path diversity and network reliability. We also present an efficient routing algorithm to make full use of the path diversity of Lotus, which leads to a further increase in network performance. Simulation results show that the network performance of Lotus is better than Dragonfly and 3D-Torus under realistic traffic patterns for different synchronization algorithms.

References

[1]
[Online]. Retrieved from https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf.
[2]
[Online]. Retrieved from https://item.jd.com/10448410875.html.
[3]
[Online]. Retrieved from http://www.lusterinc.com/.
[4]
[Online]. Retrieved from https://www.finisar.com/.
[5]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Savannah, GA, 265--283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.
[6]
Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K. Panda. 2018. Optimized broadcast for deep learning workloads on dense-GPU infiniband clusters: MPI or NCCL? In Proceedings of the 25th European MPI Users’ Group Meeting (EuroMPI’18). ACM, New York, NY, Article 2, 9 pages.
[7]
P. Bakopoulos, K. Christodoulopoulos, G. Landi, M. Aziz, E. Zahavi, D. Gallico, R. Pitwon, K. Tokas, I. Patronas, M. Capitani, C. Spatharakis, K. Yiannopoulos, K. Wang, K. Kontodimas, I. Lazarou, P. Wieder, D. I. Reisis, E. M. Varvarigos, M. Biancani, and H. Avramopoulos. 2018. NEPHELE: An end-to-end scalable and dynamically reconfigurable optical architecture for application-aware SDN cloud data centers. IEEE Commun. Mag. 56, 2 (2018), 178--188.
[8]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (Mar. 2003), 1137--1155. Retrieved from http://dl.acm.org/citation.cfm?id=944919.944966.
[9]
Madeleine Glick Cheng, Qixiang and Keren Bergman. 2020. Optical interconnection networks for high-performance systems. Optical Fiber Telecommunications VII. Academic Press (2020), 785--825. Retrieved from https://www.sciencedirect.com/science/article/pii/B9780128165027000208.
[10]
G. E. Dahl, D. Yu, L. Deng, and A. Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 1 (Jan. 2012), 30--42.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[12]
Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li. 2018. HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning. In Proceedings of the Workshop on Network Meets AI 8 ML (NetAI’18). ACM, New York, NY, 1--7.
[13]
Andrew Gibiansky. 2017. Bringing HPC techniques to deep learning.(2017). Retrieved from http://research.baidu.com/bringing-hpc-techniques-deep-learning.
[14]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[15]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training imagenet in 1 Hour. Retrieved from http://arxiv.org/abs/1706.02677.
[16]
P. Grani, R. Proietti, V. Akella, and S. J. B. Yoo. 2017. Design and evaluation of AWGR-based photonic noc architectures for 2.5d integrated high performance computing systems. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 289--300.
[17]
P. Grani, R. Proietti, S. Cheung, and S. J. Ben Yoo. 2016. Flat-topology high-throughput compute node with AWGR-based optical-interconnects. J. Lightwave Technol. 34, 12 (June 2016), 2959--2968.
[18]
Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. SIGCOMM Comput. Commun. Rev. 40, 4 (Aug. 2010), 195--206.
[19]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 620--629.
[20]
Kaiming He and Jian Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5353--5360.
[21]
S. Horiguchi and T. Ooki. 2000. Hierarchical 3D-torus interconnection network. In Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN’00). 50--56.
[22]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. Retrieved from http://arxiv.org/abs/1807.11205.
[23]
D. G. Kam, M. B. Ritter, T. J. Beukema, J. F. Bulzacchelli, P. K. Pepeljugoski, Y. H. Kwark, L. Shan, X. Gu, C. W. Baks, R. A. John, G. Hougham, C. Schuster, R. Rimolo-Donadio, and B. Wu. 2009. Is 25 Gb/s on-board signaling viable? IEEE Trans. Adv. Packag. 32, 2 (May 2009), 328--344.
[24]
J. Kim, W. J. Dally, S. Scott, and D. Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In Proceedings of the International Symposium on Computer Architecture. 77--88.
[25]
Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY, Article 3, 16 pages.
[26]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceeding sof the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583--598. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu.
[27]
H. Mao, S. Yao, T. Tang, B. Li, J. Yao, and Y. Wang. 2018. Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Topics Comput. 6, 3 (July 2018), 417--431.
[28]
Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U.-Chupala, Yoshiki Tanaka, and Yuichi Kageyama. 2018. ImageNet/ResNet-50 training in 224 seconds. Retrieved from http://arxiv.org/abs/1811.05233.
[29]
OPNET Modeler. 2009. Opnet Technologies Inc. Retrieved from https://opnetprojects.com/opnet-network-simulator/.
[30]
Roberto Proietti, Zheng Cao, Christopher J. Nitta, Yuliang Li, and S. J. Ben Yoo. 2015. A scalable, low-latency, high-throughput, optical interconnect architecture based on arrayed waveguide grating routers. J. Lightwave Technol. 33, 4 (Feb. 2015), 911--920. Retrieved from http://jlt.osa.org/abstract.cfm?URI=jlt-33-4-911.
[31]
Baidu Research. 2017. baidu-allreduce. Retrieved from https://github.com/baidu-research/baidu-allreduce.
[32]
Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco. 2018. Can the network be the ai accelerator? In Proceedings of the Morning Workshop on In-Network Computing (NetCompute’18). ACM, New York, NY, 20--25.
[33]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arxiv.org/abs/1409.1556.
[34]
N. Singh and S. Rao. 2014. Ensemble learning for large-scale workload prediction. IEEE Trans. Emerg. Topics Comput. 2, 2 (June 2014), 149--165.
[35]
M. A. Taubenblatt. 2012. Optical interconnects for high-performance computing. J. Lightwave Technol. 30, 4 (Feb. 2012), 448--457.
[36]
Mowei Wang, Yong Cui, Shihan Xiao, Xin Wang, Dan Yang, Kai Chen, and Jun Zhu. 2018. Neural network meets DCN: Traffic-driven topology adaptation with deep learning. Proc. ACM Meas. Anal. Comput. Syst. 2, 2, Article 26 (June 2018), 25 pages.
[37]
S. Wang, D. Li, J. Geng, Y. Gu, and Y. Cheng. 2019. Impact of network topology on the performance of DML: Theoretical analysis and practical factors. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’19). 1729--1737.
[38]
Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. 2016. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 84--97.
[39]
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (June 2015), 49--67.
[40]
Y. Yin, R. Proietti, X. Ye, C. J. Nitta, V. Akella, and S. J. B. Yoo. 2013. LIONS: An AWGR-based low-latency optical switch for high-performance computing and data centers. IEEE J. Select. Topics Quant. Electr. 19, 2 (2013), 3600409--3600409.
[41]
Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image classification at supercomputer scale. Retrieved from http://arxiv.org/abs/1811.06992.
[42]
D. Yun, C. Q. Wu, and M. M. Zhu. 2017. Transport-support workflow composition and optimization for big data movement in high-performance networks. IEEE Trans. Parallel Distrib. Syst. 28, 12 (Dec. 2017), 3656--3670.

Cited By

View all
  • (2023)Efficient neural network accelerators with optical computing and communicationComputer Science and Information Systems10.2298/CSIS220131066X20:1(513-535)Online publication date: 2023
  • (2023)Modoru: Clos nanosecond optical switching for distributed deep training [Invited]Journal of Optical Communications and Networking10.1364/JOCN.49930316:1(A40)Online publication date: 13-Dec-2023
  • (2022)OSDLComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.109191214:COnline publication date: 4-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 17, Issue 1
January 2021
232 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3425108
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 17 September 2020
Accepted: 01 August 2020
Revised: 01 June 2020
Received: 01 March 2020
Published in JETC Volume 17, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Optical interconnects
  2. machine learning
  3. routing algorithm
  4. topology

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Shaanxi Province for Distinguished Young Scholars
  • National Key R8D Program of China
  • The Youth Innovation Team of Shaanxi Universities
  • Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing
  • Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)92
  • Downloads (Last 6 weeks)6
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient neural network accelerators with optical computing and communicationComputer Science and Information Systems10.2298/CSIS220131066X20:1(513-535)Online publication date: 2023
  • (2023)Modoru: Clos nanosecond optical switching for distributed deep training [Invited]Journal of Optical Communications and Networking10.1364/JOCN.49930316:1(A40)Online publication date: 13-Dec-2023
  • (2022)OSDLComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.109191214:COnline publication date: 4-Sep-2022
  • (2020)CEFSProceedings of the 16th International Conference on emerging Networking EXperiments and Technologies10.1145/3386367.3431307(136-148)Online publication date: 23-Nov-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media