research-article

Open access

MegTaiChi: dynamic tensor-based memory management optimization for DNN training

Authors:

Xiaoyang Zhang,

Guangming TanAuthors Info & Claims

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Article No.: 25, Pages 1 - 13

https://doi.org/10.1145/3524059.3532394

Published: 28 June 2022 Publication History

Abstract

In real applications, it is common to train deep neural networks (DNNs) on modest clusters. With the continuous increase of model size and batch size, the training of DNNs becomes challenging under restricted memory budget. The tensor partition and tensor rematerialization are two major memory optimization techniques to enable larger model size and batch size within the limited-memory constrain. However, the related algorithms failed to fully extract the memory reduction opportunity, because they ignored the invariable characteristics of dynamic computational graphs and the variation among the same size tensors at different memory locations. In this work, we propose MegTaiChi, a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization. The key feature of MegTaiChi is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, MegTaiChi exploits the total memory optimization space and achieves the heuristic, adaptive and fine-grained memory management. The experimental results show, MegTaiChi can reduce the memory footprint by up to 11% for ResNet-50 and 10.5% for GL-base compared with DTR. For the training of 6 representative DNNs, MegTaiChi outperforms MegEngine and Sublinear by 5X and 2.4X of the maximum batch sizes. Compared with FlexFlow, Gshard and ZeRo-3, MegTaiChi achieves 1.2X, 1.8X and 1.5X performance speedups respectively on average. For the million-scale face recognition application, Meg-TaiChi achieves 1.8X speedup compared with the optimal empirical parallelism strategy on 256 GPUs.

References

[1]

2022. MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation. https://github.com/MegEngine/MegEngine.

[2]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 265--283.

[3]

Shriram B, Anshuj Garg, and Purushottam Kulkarni. 2019. Dynamic Memory Management for GPU-Based Training of Deep Neural Networks. 200--209.

[4]

Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. 2019. Qsparse-local-SGD: Distributed SGD with quantization, sparsification, and local computations. arXiv preprint arXiv:1906.02367 (2019).

[5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]

[6]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. CoRR abs/1604.06174 (2016). arXiv:1604.06174 http://arxiv.org/abs/1604.06174

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.

[8]

Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir, and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 10, 20 pages.

Digital Library

[9]

Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluç. 2017. Integrated Model and Data Parallelism in Training Neural Networks. CoRR abs/1712.04432 (2017). arXiv:1712.04432 http://arxiv.org/abs/1712.04432

[10]

Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: A scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 conference on Data communication. 51--62.

Digital Library

[11]

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2019. Single Path One-Shot Neural Architecture Search with Uniform Sampling. CoRR abs/1904.00420 (2019). arXiv:1904.00420 http://arxiv.org/abs/1904.00420

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385

[13]

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 1341--1355.

Digital Library

[14]

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 497--511. https://proceedings.mlsys.org/paper/2020/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf

[15]

Xianyan Jia, Le Jiang, Ang Wang, Jie Zhang, Xinyuan Li, Wencong Xiao, Langshi chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin. 2021. Whale: Scaling Deep Learning Model Training to the Trillions. arXiv:2011.09208 [cs.DC]

[16]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. 2018. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018).

[17]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 1--13. https://proceedings.mlsys.org/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf

[18]

Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2021. Dynamic Tensor Rematerialization. arXiv:2006.09616 [cs.LG]

[19]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668 [cs.CL]

[20]

Edmund B Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat datacenter storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 1--15.

[21]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[22]

Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-Based GPU Memory Management for Deep Learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 891--905.

Digital Library

[23]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-speed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505--3506.

Digital Library

[24]

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13.

[25]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. 2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/3a37abdeefe1dab1b30f7c5c7e581b93-Paper.pdf

[26]

Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, and Simon See. 2019. Understanding top-k sparsification in distributed deep learning. arXiv preprint arXiv:1911.08772 (2019).

[27]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019). arXiv:1909.08053 http://arxiv.org/abs/1909.08053

[28]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]

[29]

Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA. arXiv:2008.11421 [cs.DC]

Digital Library

[30]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP '18). Association for Computing Machinery, New York, NY, USA, 41--53.

Digital Library

[31]

Minjie Wang, Chien chin Huang, and Jinyang Li. 2018. Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling. arXiv:1805.04170 [cs.DC]

[32]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys '19). Association for Computing Machinery, New York, NY, USA, Article 26, 17 pages.

Digital Library

[33]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting Very Large Models using Automatic Dataflow Graph Partitioning. Proceedings of the Fourteenth EuroSys Conference 2019 (Mar 2019).

Digital Library

[34]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CoRR abs/1707.01083 (2017). arXiv:1707.01083 http://arxiv.org/abs/1707.01083

Cited By

Liu WLi MTan GJia W(2025)Mario: Near Zero-cost Activation Checkpointing in Pipeline ParallelismProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710878(197-211)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710878
Zhao PZhang HFu FNie XLiu QYang FPeng YJiao DLi SXue JTao YCui B(2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709703
B GMarion Lincy G RRishekeeshan ADeekshitha (2024)Accelerating Native Inference Model Performance in Edge Devices using TensorRT2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS)10.1109/RAICS61201.2024.10690032(1-7)Online publication date: 16-May-2024
https://doi.org/10.1109/RAICS61201.2024.10690032
Show More Cited By

Index Terms

MegTaiChi: dynamic tensor-based memory management optimization for DNN training
1. Computing methodologies
  1. Artificial intelligence
  2. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

On norm compression inequalities for partitioned block tensors
Abstract
When a tensor is partitioned into subtensors, some tensor norms of these subtensors form a tensor called a norm compression tensor. Norm compression inequalities for tensors focus on the relation of the norm of this compressed tensor to the norm ... $(^{})$
Bounds on the Spectral Norm and the Nuclear Norm of a Tensor Based on Tensor Partitions

It is known that computing the spectral norm and the nuclear norm of a tensor is NP-hard in general. In this paper, we provide neat bounds for the spectral norm and the nuclear norm of a tensor based on tensor partitions. The spectral norm (respectively, ...
Image compressive sensing via Truncated Schatten-p Norm regularization

Low-rank property as a useful image prior has attracted much attention in image processing communities. Recently, a nonlocal low-rank regularization (NLR) approach toward exploiting low-rank property has shown the state-of-the-art performance in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

June 2022

514 pages

ISBN:9781450392815

DOI:10.1145/3524059

General Chairs:
Lawrence Rauchwerger
University of Illinois at Urbana-Champaign
,
Kirk Cameron
Virginia Tech
,
Program Chairs:
Dimitrios S. Nikolopoulos
Virginia Tech
,
Dionisios Pnevmatikatos
National Technical University of Athens

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICS '22

Sponsor:

SIGARCH

ICS '22: 2022 International Conference on Supercomputing

June 28 - 30, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,518
Total Downloads

Downloads (Last 12 months)653
Downloads (Last 6 weeks)52

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu WLi MTan GJia W(2025)Mario: Near Zero-cost Activation Checkpointing in Pipeline ParallelismProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710878(197-211)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710878
Zhao PZhang HFu FNie XLiu QYang FPeng YJiao DLi SXue JTao YCui B(2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709703
B GMarion Lincy G RRishekeeshan ADeekshitha (2024)Accelerating Native Inference Model Performance in Edge Devices using TensorRT2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS)10.1109/RAICS61201.2024.10690032(1-7)Online publication date: 16-May-2024
https://doi.org/10.1109/RAICS61201.2024.10690032
Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Zong ZLin LLin LWen LSun Y(2023)STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326611034:8(2403-2418)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3266110
Liao JLi MYang HSun QSun BHao JFeng TYu FChen STao YZhang ZLuan ZQian D(2023)Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00025(156-166)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00025
Schuler MMembarth RSlusallek P(2022)XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous EnvironmentsACM Transactions on Architecture and Code Optimization10.1145/356895620:1(1-25)Online publication date: 16-Dec-2022
https://dl.acm.org/doi/10.1145/3568956

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten