Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation Models

Published: 01 May 2023 Publication History

Abstract

Foundation models are in the process of becoming the dominant deep learning technology. Pretraining a foundation model is always time-consuming due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the pretraining process is extremely memory- and communication-intensive. These challenges make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism, and tensor model parallelism, to achieve high training efficiency. However, current 3D parallelism frameworks still encounter two issues: i) they are not transparent to model developers, requiring manual model modification to parallelize training, and ii) their utilization of computation resources, GPU memory, and network bandwidth is insufficient. We propose <italic>Merak</italic>, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys 3D parallelism with an automatic model partitioner, which includes a graph-sharding algorithm and proxy node-based model graph. Merak also offers a non-intrusive API to scale out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine that employs several techniques to exploit available training resources, including a shifted critical path pipeline schedule that increases computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs demonstrate Merak&#x0027;s capability to speed up training performance over state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42, 1.39, 1.43, and 1.61&#x00D7;, respectively.

References

[1]
R. Bommasaniet al., “On the opportunities and risks of foundation models,” 2021,.
[2]
A. Vaswaniet al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., vol. 30, 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[3]
T. Brownet al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1877–1901.
[4]
S. Smithet al., “Using deepspeed and megatron to train megatron-turing NLG 530B, a large-scale generative language model,” 2022,.
[5]
W. Zenget al., “Pangu-: Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” 2021,.
[6]
Microsoft, “DeepSpeed: Extreme-scale model training for everyone,” 2020. [Online]. Available: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
[7]
D. Narayananet al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–15.
[8]
Z. Bianet al., “Colossal-AI: A unified deep learning system for large-scale parallel training,” 2021,.
[9]
S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: Scalable, low-cost training of massive deep learning models,” in Proc. 17th Eur. Conf. Comput. Syst., 2022, pp. 472–487.
[10]
C. Karakuset al., “Amazon sagemaker model parallelism: A general and flexible framework for large model training,” 2021,.
[11]
J. Yuanet al., “Oneflow: Redesign the distributed deep learning framework from scratch,” 2022,.
[12]
Y. Aoet al., “End-to-end adaptive distributed training on paddlepaddle,” 2021,.
[13]
A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8024–8035.
[14]
T. Wolfet al., “Transformers: State-of-the-art natural language processing,” Association for Computational Linguistics, 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
[15]
J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. New York, NY, USA: Association for Computing Machinery, 2020, pp. 3505–3506.
[16]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019,.
[17]
A. Radfordet al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019, Art. no.
[18]
C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2020,.
[19]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018,.
[20]
X. Jiaet al., “Whale: Efficient giant model training over heterogeneous GPUs,” in Proc. USENIX Annu. Tech. Conf., Carlsbad, CA: USENIX Association, 2022, pp. 673–688. [Online]. Available: https://www.usenix.org/conference/atc22/presentation/jia-xianyan
[21]
A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep learning in tensorflow,” 2018,.
[22]
S. Liet al., “Pytorch distributed: Experiences on accelerating data parallel training,” Proc. VLDB Endow, vol. 13, no. 12, pp. 3005–3018, Aug. 2020.
[23]
S. Ganet al., “BAGUA: Scaling up distributed learning with system relaxations,” Proc. VLDB Endowment, vol. 15, no. 4, pp. 804–813, 2021.
[24]
P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, 2009.
[25]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2020, pp. 1–16.
[26]
J. Renet al., “ZeRO-Offload: Democratizing billion-scale model training,” in Proc. USENIX Annu. Tech. Conf., 2021, pp. 551–564. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/ren-jie
[27]
S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-infinity: Breaking the GPU memory wall for extreme scale deep learning,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–14.
[28]
J. Fang, Y. Yu, Z. Zhu, S. Li, Y. You, and J. Zhou, “PatrickStar: Parallel training of pre-trained models via chunk-based memory management,” 2021,.
[29]
D. Narayananet al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Operating Syst. Princ., 2019, pp. 1–15.
[30]
D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-efficient pipeline-parallel DNN training,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 7937–7947.
[31]
J. H. Parket al., “HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism,” in Proc. USENIX Annu. Techn. Conf., 2020, pp. 307–321. [Online]. Available: https://www.usenix.org/conference/atc20/presentation/park
[32]
S. Eliad, I. Hakimi, A. De Jagger, M. Silberstein, and A. Schuster, “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism,” in Proc. USENIX Annu. Tech. Conf., 2021, pp. 381–396.
[33]
A. Kosson, V. Chiley, A. Venigalla, J. Hestness, and U. Koster, “Pipelined backpropagation at scale: Training large models without batches,” in Proc. Mach. Learn. Syst. Conf., 2021, pp. 479–501.
[34]
B. Yang, J. Zhang, J. Li, C. Ré, C. Aberger, and C. De Sa, “PipeMare: Asynchronous pipeline parallel DNN training,” in Proc. Mach. Learn. Syst. Conf., 2021, pp. 269–296.
[35]
Y. Huanget al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., Curran Associates, Inc., vol. 32, 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
[36]
J. Zhan and J. Zhang, “Pipe-torch: Pipeline-based distributed deep learning in a GPU cluster with heterogeneous networking,” in Proc. 7th Int. Conf. Adv. Cloud Big Data, 2019, pp. 55–60.
[37]
S. Fanet al., “DAPPLE: A pipelined data parallel approach for training large models,” in Proc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2021, pp. 431–445.
[38]
X. Yeet al., “Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large DNN training,” in Proc. 50th Int. Conf. Parallel Process., 2021, pp. 1–10.
[39]
S. Li and T. Hoefler, “Chimera: Efficiently training large-scale neural networks with bidirectional pipelines,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2021, pp. 1–14.
[40]
Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2D method for training super-large deep learning models,” 2021,.
[41]
B. Wang, Q. Xu, Z. Bian, and Y. You, “2.5-dimensional distributed model training,” 2021,.
[42]
Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks, 2021,.
[43]
T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016,.
[44]
M. Kirisameet al., “Dynamic tensor rematerialization,” 2020,.
[45]
P. Jainet al., “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” in Proc. Conf. Mach. Learn. Syst., 2020, pp. 497–511.
[46]
P. Lianget al., “A survey on auto-parallelism of neural networks training,” Apr. 2022.
[47]
J. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” in Proc. Conf. Mach. Learn. Syst., 2022, pp. 638–651.
[48]
A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020,.
[49]
Z. Liuet al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 9992–10002.
[50]
W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li, “AutoPipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing,” in Proc. IEEE Int. Conf. Cluster Comput., 2022, pp. 301–312.
[51]
“NVIDIA collective communications library (NCCL),”2019. [Online]. Available: https://developer.nvidia.com/nccl
[52]
A. Gokaslan and V. Cohen, “Openwebtext corpus,” 2019. [Online]. Available: http://Skylion007.github.io/OpenWebTextCorpus

Cited By

View all
  • (2024)Mbapp: Efficient Memory-Balanced Pipeline Parallelism for Large Model Fine-Tuning on Commodity GPU ServersProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671153(7-11)Online publication date: 26-Apr-2024
  • (2024)Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340602735:8(1387-1399)Online publication date: 28-May-2024
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 34, Issue 5
May 2023
321 pages

Publisher

IEEE Press

Publication History

Published: 01 May 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mbapp: Efficient Memory-Balanced Pipeline Parallelism for Large Model Fine-Tuning on Commodity GPU ServersProceedings of the 5th International Conference on Computer Information and Big Data Applications10.1145/3671151.3671153(7-11)Online publication date: 26-Apr-2024
  • (2024)Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340602735:8(1387-1399)Online publication date: 28-May-2024
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
  • (2024)Distributed Foundation Models for Multi-Modal Learning in 6G Wireless NetworksIEEE Wireless Communications10.1109/MWC.009.230050131:3(20-30)Online publication date: 14-Jun-2024
  • (2024)Advances of Pipeline Model Parallelism for Deep Learning Training: An OverviewJournal of Computer Science and Technology10.1007/s11390-024-3872-339:3(567-584)Online publication date: 1-May-2024
  • (2023)Oobleck: Resilient Distributed Training of Large Models Using Pipeline TemplatesProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613152(382-395)Online publication date: 23-Oct-2023

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media