Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3497776.3517766acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article
Open access

Training of deep learning pipelines on memory-constrained GPUs via segmented fused-tiled execution

Published: 18 March 2022 Publication History

Abstract

Training models with massive inputs is a significant challenge in the development of Deep Learning pipelines to process very large digital image datasets as required by Whole Slide Imaging (WSI) in computational pathology and analysis of brain fMRI images in computational neuroscience. Graphics Processing Units (GPUs) represent the primary workhorse in training and inference of Deep Learning models. In order to use GPUs to run inference or training on a neural network pipeline, state-of-the-art machine learning frameworks like PyTorch and TensorFlow currently require that the collective memory on the GPUs must be larger than the size of the activations at any stage in the pipeline. Therefore, existing Deep Learning pipelines for these use cases have been forced to develop sub-optimal "patch-based" modeling approaches, where images are processed in small segments of an image. In this paper, we present a solution to this problem by employing tiling in conjunction with check-pointing, thereby enabling arbitrarily large images to be directly processed, irrespective of the size of global memory on a GPU and the number of available GPUs. Experimental results using PyTorch demonstrate enhanced functionality/performance over existing frameworks.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org
[2]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783725
[3]
Riddhish Bhalodia, Shireen Y Elhabian, Ladislav Kavan, and Ross T Whitaker. 2018. Deepssm: A deep learning framework for statistical shape modeling from raw images. In International Workshop on Shape in Medical Imaging. 244–257. https://doi.org/10.1007/978-3-030-04747-4_23
[4]
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Alexandre V Evfimievski, and Prithviraj Sen. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv preprint arXiv:1801.00829.
[5]
Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. 2018. Efficient and robust parallel dnn training through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839.
[6]
Chi-Long Chen, Chi-Chung Chen, Wei-Hsiang Yu, Szu-Hua Chen, Yu-Chan Chang, Tai-I Hsu, Michael Hsiao, Chao-Yuan Yeh, and Cheng-Yu Chen. 2021. An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning. Nature communications, 12, 1 (2021), 1–13. https://doi.org/10.1038/s41467-021-21467-y
[7]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
[8]
Torch Contributors. 2018. Periodic checkpointing in pytorch. https://pytorch.org/docs/stable/checkpoint.html
[9]
Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and Brian Van Essen. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 210–220. https://doi.org/10.1109/IPDPS.2019.00031
[10]
Brian Van Essen, Hyojin Kim, Roger A. Pearce, Kofi Boakye, and Barry Chen. 2015. LBANN: livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC 2015, Austin, Texas, USA, November 15, 2015. ACM, 5:1–5:6. https://doi.org/10.1145/2834892.2834897
[11]
Russell J Hewett and Thomas J Grady II. 2020. A linear algebraic approach to model parallelism in deep learning. arXiv preprint arXiv:2006.03108.
[12]
Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maqbool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhableswar K. Panda, Raghu Machiraju, and Anil Parwani. 2020. GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. https://doi.org/10.1109/SC41405.2020.00049
[13]
Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. 2015. An Evaluation of Unified Memory Technology on NVIDIA GPUs. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 1092–1098. https://doi.org/10.1109/CCGrid.2015.105
[14]
Pooya Mobadersany, Lee AD Cooper, and Jeffery A Goldstein. 2021. GestAltNet: aggregation and attention to improve deep learning of gestational age from placental whole-slide images. Laboratory Investigation, 1–10. https://doi.org/10.1038/s41374-021-00579-5
[15]
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 883–898. https://doi.org/10.1145/3453483.3454083
[16]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. https://doi.org/10.5555/3454287.3455008
[17]
Joseph Redmon. 2013–2016. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/
[18]
Elvis Rojas, Albert Njoroge Kahira, Esteban Meneses, Leonardo Bautista-Gomez, and Rosa M. Badia. 2020. A Study of Checkpointing in Large Scale Training of Deep Neural Networks. CoRR, abs/2012.00825 (2020), arXiv:2012.00825. arxiv:2012.00825
[19]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[20]
TensorFlow. 2019. XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla
[21]
Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA. CoRR, abs/2008.11421 (2020), https://doi.org/10.5555/3433701.3433726 arXiv:2008.11421.
[22]
Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37, 11 (2018), 2348–2359. https://doi.org/10.1109/TCAD.2018.2858384
[23]
Wentao Zhu, Can Zhao, Wenqi Li, Holger Roth, Ziyue Xu, and Daguang Xu. 2020. Lamp: Large deep nets with automated model parallelism for image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. 374–384. https://doi.org/10.1007/978-3-030-59719-1_37

Cited By

View all
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
March 2022
253 pages
ISBN:9781450391832
DOI:10.1145/3497776
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2022

Check for updates

Badges

Author Tags

  1. Checkpointing
  2. DNN
  3. Fusion
  4. GPU
  5. Large image training
  6. Memory-constrained execution
  7. Tiling

Qualifiers

  • Research-article

Funding Sources

Conference

CC '22
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)227
  • Downloads (Last 6 weeks)17
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media