research-article

Open access

Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models

Authors:

Sudarshan Srinivasan,

Srinivas Sridharan,

Tushar KrishnaAuthors Info & Claims

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Pages 581 - 596

https://doi.org/10.1145/3470496.3527382

Published: 11 June 2022 Publication History

Abstract

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.

References

[1]

2008. Introduction to InfiniBand™. https://network.nvidia.com/related-docs/whitepapers/IB_Intro_WP_190.pdf.

[2]

2015. MPI: A Message-Passing Interface Standard. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.

[3]

2016. Graphcore. https://www.graphcore.ai/.

[4]

2016. Habana. https://habana.ai.

[5]

2017. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl

[6]

2018. Cloud TPU. https://cloud.google.com/tpu.

[7]

2019. Gaudi Training Platfrom White Paper. https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf.

[8]

2019. Introduction to High Bandwidth and Low Latency Network Design with 400GE. https://www.ciscolive.com/c/dam/r/ciscolive/us/docs/2019/pdf/BRKDCN-2213.pdf.

[9]

2019. NVIDIA DGX-2. https://www.nvidia.com/en-us/data-center/dgx-2/

[10]

2019. The First Xe-HPC Deployment: Aurora, with Xe Link. https://www.anandtech.com/show/15188/analyzing-intels-discrete-xe-hpc-graphics-disclosure-ponte-vecchio/5.

[11]

2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. https://github.com/astra-sim/astra-sim.git.

[12]

2020. Mellanox SHARP. https://docs.mellanox.com/display/sharpv214.

[13]

2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/.

[14]

2021. AMD Infinity Architecture. https://www.amd.com/en/technologies/infinity-architecture.

[15]

2021. AMD Instinct™ MI250X Accelerator. amd.com/en/products/server-accelerators/instinct-mi250x.

[16]

2021. ConnectX SmartNICs. https://www.nvidia.com/en-in/networking/ethernet-adapters/.

[17]

2021. Fully Sharded Data Parallel: faster AI training with fewer GPUs. https://engineering.fb.com/2021/07/15/open-source/fsdp.

[18]

2021. Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer. https://www.infoq.com/news/2021/02/google-trillion-parameter-ai/.

[19]

2021. NVIDIA DGX SuperPOD: Instant Infrastructure for AI Leadership. https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09.

[20]

2022. Intel Ponte Vecchio. https://www.nextplatform.com/2021/08/24/intels-ponte-vecchio-gpu-better-not-be-a-bridge-too-far/.

[21]

2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/.

[22]

2022. NVLink AND NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/.

[23]

Gene M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In April 18--20, 1967, Spring Joint Computer Conference (AFIPS). 483--485.

Digital Library

[24]

Dario Amodei and Danny Hernandez. 2018. AI and Compute. https://openai.com/blog/ai-and-compute/

[25]

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In 44th Annual International Symposium on Computer Architecture (ISCA). 320--332.

Digital Library

[26]

M. Barnett, R. Littlefield, D.G. Payne, and R. van de Geijn. 1993. Global combine on mesh architectures with wormhole routing. In 7th International Parallel Processing Symposium (IPPS). 156--162.

Digital Library

[27]

Shahid H. Bokhari and Harry. Berryman. 1992. Complete exchange on a circuit switched mesh. In 1992 Scalable High Performance Computing Conference (SHPCC). 300--306.

[28]

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2021. Synthesizing Optimal Collective Algorithms. In 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 62--75.

Digital Library

[29]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. 2007. Collective Communication: Theory, Practice, and Experience: Research Articles. Concurrency and Computation: Practice and Experience 19, 13 (Sep. 2007), 1749--1783.

[30]

Ernie Chan, Robert van de Geijn, William Gropp, and Rajeev Thakur. 2006. Collective Communication on Architectures That Support Simultaneous Communication over Multiple Links. In 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 2--11.

Digital Library

[31]

M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development 63, 6 (Oct. 2019), 1:1--11.

[32]

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2022. GC3: An Optimizing Compiler for GPU Collective Communication. arXiv:2201.11840 [cs.DC]

[33]

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, Shigang Li, and Torsten Hoefler. 2021. Flare: Flexible in-Network Allreduce. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--16.

Digital Library

[34]

Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, Yiqun Guo, Xiaowei Jiang, Lingbo Tang, Yin Du, Yingya Zhang, Pan Pan, and Yuan Xie. 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 610--622.

[35]

Ethernet Technology Consortium. 2020. 800G Specification. https://ethernettechnologyconsortium.org/wp-content/uploads/2020/03/800G-Specification_r1.0.pdf.

[36]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 [cs.DC]

[37]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]

[38]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv:1811.06965 [cs.CV]

[39]

Intel. 2020. Intel oneAPI Collective Communications Library. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/oneccl.html

[40]

Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-chip networks. Synthesis Lectures on Computer Architecture 12, 3 (2017), 1--210.

[41]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 [cs.LG]

[42]

Behnam Kazemivash and Vince D. Calhoun. 2020. BPARC: A novel spatio-temporal (4D) data-driven brain parcellation scheme based on deep residual networks. In IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE). 1071--1076.

[43]

Behnam Kazemivash and Vince D. Calhoun. 2022. A novel 5D brain parcellation approach based on spatio-temporal encoding of resting fMRI data from deep residual learning. Journal of Neuroscience Methods 369 (2022), 109478.

[44]

Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. 2020. An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives. In 47th Annual International Symposium on Computer Architecture (ISCA). 996--1009.

Digital Library

[45]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. arXiv:1903.04611 [cs.AR]

[46]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating Distributed Reinforcement Learning with In-Switch Computing. In 46th International Symposium on Computer Architecture (ISCA). 279--291.

Digital Library

[47]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud. In 2022 Machine Learning and Systems (MLSys). 82--97. https://proceedings.mlsys.org/paper/2020/file/182be0c5cdcd5072bb1864cdee4d3d6e-Paper.pdf

[48]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2021. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. arXiv:2104.05158 [cs.DC]

[49]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv:1906.00091 [cs.IR]

[50]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. Parallel and Distributed Computing 69, 2 (Feb. 2009), 117--124.

Digital Library

[51]

Jelena Pjesivac-Grbovic, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, and Jack J. Dongarra. 2005. Performance analysis of MPI collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1--8.

Digital Library

[52]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054 [cs.LG]

[53]

Saeed Rashidi, Pallavi Shurpali, Srinivas Sridharan, Naader Hassani, Dheevatsa Mudigere, Krishnakumar Nair, Misha Smelyanski, and Tushar Krishna. 2020. Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport. In 2020 IEEE Symposium on High-Performance Interconnects (HOTI). 33--42.

[54]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Matthew Denton, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. 2021. Enabling Compute-Communication Overlap in Distributed Training Platforms. In 48th International Symposium on Computer Architecture (ISCA). 540--553.

Digital Library

[55]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2020. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81--92.

[56]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. 2021. Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL. arXiv:2111.04867 [cs.DC]

[57]

Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 14--27.

Digital Library

[58]

Steve Sistare, Rolf Vandevaart, and Eugene Loh. 1999. Optimization of MPI Collectives on Clusters of Large-Scale SMP' s. In 1999 ACM/IEEE Conference on Supercomputing (SC). 23--23.

[59]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19, 1 (Feb. 2005), 49--66.

Digital Library

[60]

Indu Thangakrishnan, Derya Cavdar, Can Karakus, Piyush Ghai, Yauheni Selivonchyk, and Cory Pruce. 2020. Herring: Rethinking the Parameter Server at Scale for the Cloud. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--13.

[61]

Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In 2003 International Parallel and Distributed Processing Symposium (IPDPS). 1--10.

[62]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs.CL]

[63]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In 2020 Machine Learning and Systems (MLSys). 172--186. https://proceedings.mlsys.org/paper/2020/file/43ec517d68b6edd3015b3edc9a11367b-Paper.pdf

[64]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs.CL]

[65]

Yasaman Ghadar and Tim Williams. 2020. An Overview of Aurora, Argonne's Upcoming Exascale System. https://ecpannualmeeting.com/assets/overview/sessions/Aurora-Public-FULL-talk-Feb-4-2020_for_posting_c.pdf

[66]

Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur. 2009. Hierarchical Collectives in MPICH2. In 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI). 325--326.

Digital Library

Cited By

Liu XArzani BKakarla SZhao LLiu VCastro MKandula SMarshall LSekar VYu MSeneviratne AVeitch D(2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672249
Block CGerogiannis GMendis CAzad ATorrellas JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640427
Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3406420
Show More Cited By

Index Terms

Themis: a network bandwidth-aware collective scheduling policy for distributed training of DL models
1. Networks

Recommendations

MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying ...
A Delegation Mechanism on Many-Core Oriented Hybrid Parallel Computers for Scalability of Communicators and Communications in MPI
PDP '13: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

This paper describes a delegation based high throughput MPIcommunication mechanism under tough memory utilization constrains on a many-core oriented hybrid parallel computer. Towards the Exascale era, hybrid parallel computers consisting of many-core ...
Designing an Offloaded Nonblocking MPI_Allgather Collective Using CORE-Direct
CLUSTER '12: Proceedings of the 2012 IEEE International Conference on Cluster Computing

Collective communication operations in the Message Passing Interface (MPI) consume a significant amount of time at scale, degrading the performance of scientific applications. Optimizing collectives is key to application performance and scalability. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

June 2022

1097 pages

ISBN:9781450386104

DOI:10.1145/3470496

General Chairs:
Valentina Salapura
Google
,
Mohamed Zahran
New York University
,
Program Chairs:
Fred Chong
The University of Chicago
,
Lingjia Tang
The University of Michigan

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ISCA '22

Sponsor:

SIGARCH

ISCA '22: The 49th Annual International Symposium on Computer Architecture

June 18 - 22, 2022

New York, New York

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
2,547
Total Downloads

Downloads (Last 12 months)952
Downloads (Last 6 weeks)92

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu XArzani BKakarla SZhao LLiu VCastro MKandula SMarshall LSekar VYu MSeneviratne AVeitch D(2024)Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow ProblemProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672249(16-37)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3651890.3672249
Block CGerogiannis GMendis CAzad ATorrellas JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMMProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640427(1200-1217)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640427
Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3406420
Won WRashidi SSrinivasan SKrishna T(2024)LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00028(205-216)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00028
Hsia SGolden AAcun BArdalani NDeVito ZWei GBrooks DWu C(2024)MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00064(818-833)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00064
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Laskar SMajhi PKim SMahmud FMuzahid AKim E(2024)Enhancing Collective Communication in MCM Accelerators for Deep Learning Training2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00069(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00069
卢凯赖志李笙柳炜葛可卢锡李东(2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
https://doi.org/10.1360/SSI-2023-0051
Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th ACM International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Won WHeo TRashidi SSridharan SSrinivasan SKrishna T(2023)ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00035(283-294)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00035
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents