Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMay 2024
A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 8Pages 1415–1428https://doi.org/10.1109/TPDS.2024.3406420The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter ...
- research-articleJuly 2024
Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview
Journal of Computer Science and Technology (JCST), Volume 39, Issue 3Pages 567–584https://doi.org/10.1007/s11390-024-3872-3AbstractDeep learning has become the cornerstone of artificial intelligence, playing an increasingly important role in human production and lifestyle. However, as the complexity of problem-solving increases, deep learning models become increasingly ...
- ArticleAugust 2023
- research-articleMay 2023
Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation Models
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 34, Issue 5Pages 1466–1478https://doi.org/10.1109/TPDS.2023.3247001Foundation models are in the process of becoming the dominant deep learning technology. Pretraining a foundation model is always time-consuming due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the ...
- research-articleApril 2023
Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning Models
IEEE Journal on Selected Areas in Communications (JSAC), Volume 41, Issue 4Pages 941–963https://doi.org/10.1109/JSAC.2023.3242733Distributed data-parallel training (DDP) is prevalent in large-scale deep learning. To increase the training throughput and scalability, high-performance collective communication methods such as AllReduce have recently proliferated for DDP use. However, ...