research-article

Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN Training

Authors:

Dongsheng LiAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 71, Pages 1 - 10

https://doi.org/10.1145/3472456.3472497

Published: 05 October 2021 Publication History

Abstract

With the increase of both data and parameter volume, it has become a big challenge to efficiently train large-scale DNN models on distributed platforms. Ordinary parallelism modes, i.e., data parallelism, model parallelism and pipeline parallelism, can no longer satisfy the efficient scaling of large DNN model training on multiple nodes. Meanwhile, the problem of too much memory consumption seriously restricts GPU computing efficiency and training throughput. In this paper, we propose Hippie, a hybrid parallel training framework that integrates pipeline parallelism and data parallelism to improve the memory efficiency and scalability of large DNN training. Hippie adopts a hybrid parallel method based on hiding gradient communication, which improves the throughput and scalability of training. Meanwhile, Hippie introduces the last-stage pipeline scheduling and recomputation for specific layers to effectively reduce the memory overhead and ease the difficulties of training large DNN models on memory-constrained devices. To achieve a more reasonable evaluation of the optimization effect, we propose an index of memory efficiency (ME) to represent the tradeoff between throughput and memory overhead. We implement Hippie based on PyTorch and NCCL. Experiments on various models show that Hippie achieves above 90% scaling efficiency on a 16-GPU platform. Moreover, Hippie increases throughput by up to 80% while saving 57% of memory overhead, achieving 4.18 × memory efficiency.

References

[1]

2019. NCCL. https://developer.nvidia.com/nccl

[2]

2019. NVLink. https://www.nvidia.com/en-us/data-center/nvlink/

[3]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, 2015. TensorFlow: Large-scale machine learning on heterogeneous systems.

[4]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).

[5]

Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839(2018).

[6]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174(2016).

[7]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 571–582.

Digital Library

[8]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In International conference on machine learning. PMLR, 1337–1345.

[9]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.

Digital Library

[10]

Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, 2014. Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). 37–48.

[11]

Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric Xing. 2015. High-performance distributed ML at scale through parameter server consistency models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29.

[12]

Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le, Mark Z Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, 2012. Large scale distributed deep networks. (2012).

[13]

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, 2021. DAPPLE: a pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 431–445.

Digital Library

[14]

Jiarui Fang, Haohuan Fu, Guangwen Yang, and Cho-Jui Hsieh. 2019. RedSync: reducing synchronization bandwidth for distributed deep learning training system. J. Parallel and Distrib. Comput. 133 (2019), 30–39.

[15]

Alexandros V Gerbessiotis and Leslie G Valiant. 1994. Direct bulk-synchronous parallel algorithms. Journal of parallel and distributed computing 22, 2 (1994), 251–267.

Digital Library

[16]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677(2017).

[17]

Andreas Griewank and Andrea Walther. 2000. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS) 26, 1 (2000), 19–45.

Digital Library

[18]

Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B Gibbons, Garth A Gibson, Gregory R Ganger, and Eric P Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server. Advances in neural information processing systems 2013 (2013), 1223.

[19]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965(2018).

[20]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960(2019).

[21]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. CoRR abs/1807.05358(2018). arxiv:1807.05358http://arxiv.org/abs/1807.05358

[22]

Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P. Xing. 2016. STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning. In Proceedings of the Eleventh European Conference on Computer Systems (London, United Kingdom) (EuroSys ’16). Association for Computing Machinery, New York, NY, USA, Article 5, 16 pages. https://doi.org/10.1145/2901318.2901331

Digital Library

[23]

Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun. 2019. Parallax: Sparsity-aware data parallel training of deep neural networks. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–15.

Digital Library

[24]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997(2014). arxiv:1404.5997http://arxiv.org/abs/1404.5997

[25]

Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth Gibson, and Eric P Xing. 2014. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning. (12 2014). https://doi.org/10.1184/R1/6476048.v1

[26]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 583–598. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu

Digital Library

[27]

Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. 2018. Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. arXiv preprint arXiv:1811.03619(2018).

[28]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. PMLR, 3043–3052.

[29]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235–1241.

Digital Library

[30]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3341301.3359646

Digital Library

[31]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-efficient pipeline-parallel dnn training. arXiv preprint arXiv:2006.09503(2020).

[32]

Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J Wright. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. arXiv preprint arXiv:1106.5730(2011).

[33]

Chanyoung Oh, Zhen Zheng, Xipeng Shen, Jidong Zhai, and Youngmin Yi. 2020. GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 43–54.

Digital Library

[34]

Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar, Yaosheng Fu, Victor Zhang, Szymon Migacz, David Nellans, and Puneet Gupta. 2019. Optimizing multi-GPU parallelization strategies for deep learning training. IEEE Micro 39, 5 (2019), 91–101.

[35]

Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 307–321.

[36]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).

[37]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[38]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[39]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.

Digital Library

[40]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.

Digital Library

[41]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891(2016).

[42]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799(2018). arxiv:1802.05799http://arxiv.org/abs/1802.05799

[43]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).

[44]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 839–848.

Digital Library

[45]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 41–53.

Digital Library

[46]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144(2016).

[47]

Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and Christopher De Sa. 2021. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021).

[48]

Jun Zhan and Jinghui Zhang. 2019. Pipe-torch: Pipeline-based distributed deep learning in a gpu cluster with heterogeneous networking. In 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD). IEEE, 55–60.

[49]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 181–193.

[50]

Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 587–599.

Digital Library

Cited By

Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3406420
Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3343570
Zhou GLan HXie YTian WQian JSu T(2024)CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNNEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_20(288-301)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69766-1_20
Show More Cited By

Recommendations

Optimizing memory efficiency for deep convolutional neural networks on GPUs
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing ...
mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
Euro-Par 2022: Parallel Processing
Abstract
Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the ...
AshPipe: Asynchronous Hybrid Pipeline Parallel for DNN Training
HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Deep Neural Networks (DNNs) have become increasingly computationally intensive and have larger parameters, requiring efficient parallelization or distribution using multiple accelerators. Pipeline parallelism has been proposed as an effective way to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China under Grant

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
423
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)5

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li SLu KLai ZLiu WGe KLi D(2024)A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340642035:8(1415-1428)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3406420
Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3343570
Zhou GLan HXie YTian WQian JSu T(2024)CSIMD: Cross-Search Algorithm with Improved Multi-dimensional Dichotomy for Micro-Batch-Based Pipeline Parallel Training in DNNEuro-Par 2024: Parallel Processing10.1007/978-3-031-69766-1_20(288-301)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69766-1_20
Tahir HJung E(2023)Comparative Study on Distributed Lightweight Deep Learning Models for Road Pothole DetectionSensors10.3390/s2309434723:9(4347)Online publication date: 27-Apr-2023
https://doi.org/10.3390/s23094347
卢凯赖志李笙柳炜葛可卢锡李东(2023)Parallel intelligent computing: development and challengesSCIENTIA SINICA Informationis10.1360/SSI-2023-005153:8(1441)Online publication date: 17-Aug-2023
https://doi.org/10.1360/SSI-2023-0051
Lai ZLi STang XGe KLiu WDuan YQiao LLi D(2023)Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation ModelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.324700134:5(1466-1478)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3247001
Xu CKong WLiu MZhang MLi CGong L(2023)Hybrid Parallel Inference for Large Model on Heterogeneous Clusters for High Throughput2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00087(549-554)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00087
Wang WLai ZLi SLiu WGe KLiu YShen ALi D(2023)Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00015(82-94)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00015
Weng CLi XZhang JYu H(2023)HPopt: A Hybrid Parallel Optimization Scheduling Approach for Distributed DNN Training2023 Eleventh International Conference on Advanced Cloud and Big Data (CBD)10.1109/CBD63341.2023.00048(229-234)Online publication date: 18-Dec-2023
https://doi.org/10.1109/CBD63341.2023.00048
Duan YLai ZLi SLiu WGe KLiang PLi D(2022)HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00043(313-323)Online publication date: Sep-2022
https://doi.org/10.1109/CLUSTER51413.2022.00043

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents