research-article

Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation Models

Authors: Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, Dongsheng LiAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 5

Pages 1466 - 1478

https://doi.org/10.1109/TPDS.2023.3247001

Published: 01 May 2023 Publication History

Abstract

Foundation models are in the process of becoming the dominant deep learning technology. Pretraining a foundation model is always time-consuming due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the pretraining process is extremely memory- and communication-intensive. These challenges make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism, and tensor model parallelism, to achieve high training efficiency. However, current 3D parallelism frameworks still encounter two issues: i) they are not transparent to model developers, requiring manual model modification to parallelize training, and ii) their utilization of computation resources, GPU memory, and network bandwidth is insufficient. We propose <italic>Merak</italic>, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys 3D parallelism with an automatic model partitioner, which includes a graph-sharding algorithm and proxy node-based model graph. Merak also offers a non-intrusive API to scale out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine that employs several techniques to exploit available training resources, including a shifted critical path pipeline schedule that increases computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs demonstrate Merak's capability to speed up training performance over state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42, 1.39, 1.43, and 1.61×, respectively.

References

[1]

R. Bommasaniet al., “On the opportunities and risks of foundation models,” 2021,.

[2]

A. Vaswaniet al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., vol. 30, 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[3]

T. Brownet al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1877–1901.

[4]

S. Smithet al., “Using deepspeed and megatron to train megatron-turing NLG 530B, a large-scale generative language model,” 2022,.

[5]

W. Zenget al., “Pangu-: Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” 2021,.

[6]

Microsoft, “DeepSpeed: Extreme-scale model training for everyone,” 2020. [Online]. Available: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/

[7]

D. Narayananet al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–15.

[8]

Z. Bianet al., “Colossal-AI: A unified deep learning system for large-scale parallel training,” 2021,.

[9]

S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: Scalable, low-cost training of massive deep learning models,” in Proc. 17th Eur. Conf. Comput. Syst., 2022, pp. 472–487.

[10]

C. Karakuset al., “Amazon sagemaker model parallelism: A general and flexible framework for large model training,” 2021,.

[11]

J. Yuanet al., “Oneflow: Redesign the distributed deep learning framework from scratch,” 2022,.

[12]

Y. Aoet al., “End-to-end adaptive distributed training on paddlepaddle,” 2021,.

[13]

A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8024–8035.

[14]

T. Wolfet al., “Transformers: State-of-the-art natural language processing,” Association for Computational Linguistics, 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6

[15]

J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. New York, NY, USA: Association for Computing Machinery, 2020, pp. 3505–3506.

Digital Library

[16]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019,.

[17]

A. Radfordet al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019, Art. no.

[18]

C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2020,.

[19]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018,.

[20]

X. Jiaet al., “Whale: Efficient giant model training over heterogeneous GPUs,” in Proc. USENIX Annu. Tech. Conf., Carlsbad, CA: USENIX Association, 2022, pp. 673–688. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/jia-xianyan

[21]

A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep learning in tensorflow,” 2018,.

[22]

S. Liet al., “Pytorch distributed: Experiences on accelerating data parallel training,” Proc. VLDB Endow, vol. 13, no. 12, pp. 3005–3018, Aug. 2020.

Digital Library

[23]

S. Ganet al., “BAGUA: Scaling up distributed learning with system relaxations,” Proc. VLDB Endowment, vol. 15, no. 4, pp. 804–813, 2021.

[24]

P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, 2009.

Digital Library

[25]

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2020, pp. 1–16.

[26]

J. Renet al., “ZeRO-Offload: Democratizing billion-scale model training,” in Proc. USENIX Annu. Tech. Conf., 2021, pp. 551–564. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/ren-jie

[27]

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the GPU memory wall for extreme scale deep learning,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–14.

[28]

J. Fang, Y. Yu, Z. Zhu, S. Li, Y. You, and J. Zhou, “PatrickStar: Parallel training of pre-trained models via chunk-based memory management,” 2021,.

[29]

D. Narayananet al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Operating Syst. Princ., 2019, pp. 1–15.

[30]

D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-efficient pipeline-parallel DNN training,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 7937–7947.

[31]

J. H. Parket al., “HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism,” in Proc. USENIX Annu. Techn. Conf., 2020, pp. 307–321. [Online]. Available: https://www.usenix.org/conference/atc20/presentation/park

[32]

S. Eliad, I. Hakimi, A. De Jagger, M. Silberstein, and A. Schuster, “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism,” in Proc. USENIX Annu. Tech. Conf., 2021, pp. 381–396.

[33]

A. Kosson, V. Chiley, A. Venigalla, J. Hestness, and U. Koster, “Pipelined backpropagation at scale: Training large models without batches,” in Proc. Mach. Learn. Syst. Conf., 2021, pp. 479–501.

[34]

B. Yang, J. Zhang, J. Li, C. Ré, C. Aberger, and C. De Sa, “PipeMare: Asynchronous pipeline parallel DNN training,” in Proc. Mach. Learn. Syst. Conf., 2021, pp. 269–296.

[35]

Y. Huanget al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., Curran Associates, Inc., vol. 32, 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf

[36]

J. Zhan and J. Zhang, “Pipe-torch: Pipeline-based distributed deep learning in a GPU cluster with heterogeneous networking,” in Proc. 7th Int. Conf. Adv. Cloud Big Data, 2019, pp. 55–60.

[37]

S. Fanet al., “DAPPLE: A pipelined data parallel approach for training large models,” in Proc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2021, pp. 431–445.

[38]

X. Yeet al., “Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training,” in Proc. 50th Int. Conf. Parallel Process., 2021, pp. 1–10.

[39]

S. Li and T. Hoefler, “Chimera: Efficiently training large-scale neural networks with bidirectional pipelines,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–14.

[40]

Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2D method for training super-large deep learning models,” 2021,.

[41]

B. Wang, Q. Xu, Z. Bian, and Y. You, “2.5-dimensional distributed model training,” 2021,.

[42]

Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks, 2021,.

[43]

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016,.

[44]

M. Kirisameet al., “Dynamic tensor rematerialization,” 2020,.

[45]

P. Jainet al., “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” in Proc. Conf. Mach. Learn. Syst., 2020, pp. 497–511.

[46]

P. Lianget al., “A survey on auto-parallelism of neural networks training,” Apr. 2022.

[47]

J. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” in Proc. Conf. Mach. Learn. Syst., 2022, pp. 638–651.

[48]

A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020,.

[49]

Z. Liuet al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9992–10002.

[50]

W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li, “AutoPipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing,” in Proc. IEEE Int. Conf. Cluster Comput., 2022, pp. 301–312.

[51]

“NVIDIA collective communications library (NCCL),”2019. [Online]. Available: https://developer.nvidia.com/nccl

[52]

A. Gokaslan and V. Cohen, “Openwebtext corpus,” 2019. [Online]. Available: http://Skylion007.github.io/OpenWebTextCorpus

Cited By

Liu YLai ZLi D(2024)Mbapp: Efficient Memory-Balanced Pipeline Parallelism for Large Model Fine-Tuning on Commodity GPU ServersProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671153(7-11)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3671151.3671153
Wu CXu ZHe XLou QXia YHuang S(2024)Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340602735:8(1387-1399)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3406027
Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3343570
Show More Cited By

Recommendations

Semi-supervised ensemble DNN acoustic model training
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
It is very important to exploit abundant unlabeled speech for improving the acoustic model training in automatic speech recognition (ASR). Semi-supervised training methods incorporate unlabeled data in addition to labeled data to enhance the model ...
Parallelizing DNN Training on GPUs: Challenges and Opportunities
WWW '21: Companion Proceedings of the Web Conference 2021

In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted approach in many application domains. Training DNN models is also becoming a significant fraction of the datacenter workload. Recent evidence has demonstrated that modern ...
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures
MLHPC'17: Proceedings of the Machine Learning on HPC Environments

Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 34, Issue 5

May 2023

321 pages

ISSN:1045-9219

Issue’s Table of Contents

1045-9219 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 May 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu YLai ZLi D(2024)Mbapp: Efficient Memory-Balanced Pipeline Parallelism for Large Model Fine-Tuning on Commodity GPU ServersProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671153(7-11)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3671151.3671153
Wu CXu ZHe XLou QXia YHuang S(2024)Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340602735:8(1387-1399)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3406027
Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3343570
Du JLin TJiang CYang QBader CHan Z(2024)Distributed Foundation Models for Multi-Modal Learning in 6G Wireless NetworksIEEE Wireless Communications10.1109/MWC.009.230050131:3(20-30)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1109/MWC.009.2300501
Guan LLi DLiang JWang WGe KLu X(2024)Advances of Pipeline Model Parallelism for Deep Learning Training: An OverviewJournal of Computer Science and Technology10.1007/s11390-024-3872-339:3(567-584)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11390-024-3872-3
Jang IYang ZZhang ZJin XChowdhury MDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Oobleck: Resilient Distributed Training of Large Models Using Pipeline TemplatesProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613152(382-395)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613152

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents