Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3470496.3527382acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models

Published: 11 June 2022 Publication History

Abstract

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.

References

[1]
2008. Introduction to InfiniBand™. https://network.nvidia.com/related-docs/whitepapers/IB_Intro_WP_190.pdf.
[2]
2015. MPI: A Message-Passing Interface Standard. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.
[3]
2016. Graphcore. https://www.graphcore.ai/.
[4]
2016. Habana. https://habana.ai.
[5]
2017. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl
[6]
2018. Cloud TPU. https://cloud.google.com/tpu.
[7]
2019. Gaudi Training Platfrom White Paper. https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf.
[8]
2019. Introduction to High Bandwidth and Low Latency Network Design with 400GE. https://www.ciscolive.com/c/dam/r/ciscolive/us/docs/2019/pdf/BRKDCN-2213.pdf.
[9]
2019. NVIDIA DGX-2. https://www.nvidia.com/en-us/data-center/dgx-2/
[10]
2019. The First Xe-HPC Deployment: Aurora, with Xe Link. https://www.anandtech.com/show/15188/analyzing-intels-discrete-xe-hpc-graphics-disclosure-ponte-vecchio/5.
[11]
2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. https://github.com/astra-sim/astra-sim.git.
[12]
2020. Mellanox SHARP. https://docs.mellanox.com/display/sharpv214.
[13]
2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/.
[14]
2021. AMD Infinity Architecture. https://www.amd.com/en/technologies/infinity-architecture.
[15]
2021. AMD Instinct™ MI250X Accelerator. amd.com/en/products/server-accelerators/instinct-mi250x.
[16]
2021. ConnectX SmartNICs. https://www.nvidia.com/en-in/networking/ethernet-adapters/.
[17]
2021. Fully Sharded Data Parallel: faster AI training with fewer GPUs. https://engineering.fb.com/2021/07/15/open-source/fsdp.
[18]
2021. Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer. https://www.infoq.com/news/2021/02/google-trillion-parameter-ai/.
[19]
2021. NVIDIA DGX SuperPOD: Instant Infrastructure for AI Leadership. https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09.
[20]
2022. Intel Ponte Vecchio. https://www.nextplatform.com/2021/08/24/intels-ponte-vecchio-gpu-better-not-be-a-bridge-too-far/.
[21]
2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/.
[22]
2022. NVLink AND NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/.
[23]
Gene M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In April 18--20, 1967, Spring Joint Computer Conference (AFIPS). 483--485.
[24]
Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/
[25]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In 44th Annual International Symposium on Computer Architecture (ISCA). 320--332.
[26]
M. Barnett, R. Littlefield, D.G. Payne, and R. van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In 7th International Parallel Processing Symposium (IPPS). 156--162.
[27]
Shahid H. Bokhari and Harry. Berryman. 1992. Complete exchange on a circuit switched mesh. In 1992 Scalable High Performance Computing Conference (SHPCC). 300--306.
[28]
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing Optimal Collective Algorithms. In 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 62--75.
[29]
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. 2007. Collective Communication: Theory, Practice, and Experience: Research Articles. Concurrency and Computation: Practice and Experience 19, 13 (Sep. 2007), 1749--1783.
[30]
Ernie Chan, Robert van de Geijn, William Gropp, and Rajeev Thakur. 2006. Collective Communication on Architectures That Support Simultaneous Communication over Multiple Links. In 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 2--11.
[31]
M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development 63, 6 (Oct. 2019), 1:1--11.
[32]
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2022. GC3: An Optimizing Compiler for GPU Collective Communication. arXiv:2201.11840 [cs.DC]
[33]
Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-Network Allreduce. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--16.
[34]
Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, Yiqun Guo, Xiaowei Jiang, Lingbo Tang, Yin Du, Yingya Zhang, Pan Pan, and Yuan Xie. 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 610--622.
[35]
Ethernet Technology Consortium. 2020. 800G Specification. https://ethernettechnologyconsortium.org/wp-content/uploads/2020/03/800G-Specification_r1.0.pdf.
[36]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 [cs.DC]
[37]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
[38]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]
[39]
Intel. 2020. Intel oneAPI Collective Communications Library. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/oneccl.html
[40]
Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-chip networks. Synthesis Lectures on Computer Architecture 12, 3 (2017), 1--210.
[41]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 [cs.LG]
[42]
Behnam Kazemivash and Vince D. Calhoun. 2020. BPARC: A novel spatio-temporal (4D) data-driven brain parcellation scheme based on deep residual networks. In IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE). 1071--1076.
[43]
Behnam Kazemivash and Vince D. Calhoun. 2022. A novel 5D brain parcellation approach based on spatio-temporal encoding of resting fMRI data from deep residual learning. Journal of Neuroscience Methods 369 (2022), 109478.
[44]
Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives. In 47th Annual International Symposium on Computer Architecture (ISCA). 996--1009.
[45]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. arXiv:1903.04611 [cs.AR]
[46]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating Distributed Reinforcement Learning with In-Switch Computing. In 46th International Symposium on Computer Architecture (ISCA). 279--291.
[47]
Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In 2022 Machine Learning and Systems (MLSys). 82--97. https://proceedings.mlsys.org/paper/2020/file/182be0c5cdcd5072bb1864cdee4d3d6e-Paper.pdf
[48]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2021. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. arXiv:2104.05158 [cs.DC]
[49]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv:1906.00091 [cs.IR]
[50]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. Parallel and Distributed Computing 69, 2 (Feb. 2009), 117--124.
[51]
Jelena Pjesivac-Grbovic, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, and Jack J. Dongarra. 2005. Performance analysis of MPI collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1--8.
[52]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]
[53]
Saeed Rashidi, Pallavi Shurpali, Srinivas Sridharan, Naader Hassani, Dheevatsa Mudigere, Krishnakumar Nair, Misha Smelyanski, and Tushar Krishna. 2020. Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 33--42.
[54]
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Matthew Denton, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. 2021. Enabling Compute-Communication Overlap in Distributed Training Platforms. In 48th International Symposium on Computer Architecture (ISCA). 540--553.
[55]
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81--92.
[56]
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2021. Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL. arXiv:2111.04867 [cs.DC]
[57]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 14--27.
[58]
Steve Sistare, Rolf Vandevaart, and Eugene Loh. 1999. Optimization of MPI Collectives on Clusters of Large-Scale SMP' s. In 1999 ACM/IEEE Conference on Supercomputing (SC). 23--23.
[59]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19, 1 (Feb. 2005), 49--66.
[60]
Indu Thangakrishnan, Derya Cavdar, Can Karakus, Piyush Ghai, Yauheni Selivonchyk, and Cory Pruce. 2020. Herring: Rethinking the Parameter Server at Scale for the Cloud. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--13.
[61]
Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In 2003 International Parallel and Distributed Processing Symposium (IPDPS). 1--10.
[62]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
[63]
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In 2020 Machine Learning and Systems (MLSys). 172--186. https://proceedings.mlsys.org/paper/2020/file/43ec517d68b6edd3015b3edc9a11367b-Paper.pdf
[64]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs.CL]
[65]
Yasaman Ghadar and Tim Williams. 2020. An Overview of Aurora, Argonne's Upcoming Exascale System. https://ecpannualmeeting.com/assets/overview/sessions/Aurora-Public-FULL-talk-Feb-4-2020_for_posting_c.pdf
[66]
Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur. 2009. Hierarchical Collectives in MPICH2. In 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI). 325--326.

Cited By

View all
  • (2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
  • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
  • (2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
        June 2022
        1097 pages
        ISBN:9781450386104
        DOI:10.1145/3470496
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        In-Cooperation

        • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 11 June 2022

        Check for updates

        Author Tags

        1. bandwidth-aware communication scheduling
        2. collective communication
        3. distributed training

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ISCA '22
        Sponsor:

        Acceptance Rates

        ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)952
        • Downloads (Last 6 weeks)92
        Reflects downloads up to 02 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
        • (2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
        • (2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
        • (2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
        • (2024)MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00064(818-833)Online publication date: 29-Jun-2024
        • (2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
        • (2024)Enhancing Collective Communication in MCM Accelerators for Deep Learning Training2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00069(1-16)Online publication date: 2-Mar-2024
        • (2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
        • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
        • (2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media