Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3014904.3014977acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimizing memory efficiency for deep convolutional neural networks on GPUs

Published: 13 November 2016 Publication History

Abstract

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. While existing works mainly focus on the computational efficiency of CNNs, the memory efficiency of CNNs have been largely overlooked. Yet CNNs have intricate data structures and their memory behavior can have significant impact on the performance. In this work, we study the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns. Experiments show the universal effect of our proposed optimizations on both single layers and various networks, with up to 27.9x for a single layer and up to 5.6x on the whole networks.

References

[1]
James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python. In SCIPY, 2010.
[2]
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula and Srihari Cadambi. A dynamic configurable coprocessor for convolutional neural networks. In ISCA, 2010.
[3]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chenyong Wu and Yunji Chen. DianNao: A Small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014.
[4]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonanthan Cohen, John Tran, Bryan Catanzaro and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. CoRR, abs/1410.0759.
[5]
Dan C. Ciresan, Ueli Meier, Jonathan Masci, Luca M. Gambardella and Jurgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI, 2011.
[6]
Adam Coates, Brody Huval, Tao Wang, David J. Wu and Andrew Y. Ng. Deep learning with COTS HPC systems. In ICML, 2013.
[7]
Ronan Collobert, Koray Kavakcuoglu and Clement Farabet. Torch7: A matlab-like environment for machine learning. In NIPSW, 2011.
[8]
Jason Cong and Bingjun Bao. Minimizing computation in convolutional neural networks. In ICANN 2014.
[9]
Marc Gonzalez Tallada. Coarse Grain Parallelization of Deep Neural Networks. In PPoPP, 2016.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. CorRR, abs/1502.01852.
[11]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama and Trevor Darrel. Caffe: convolutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014.
[12]
Alex Krizhevsky, Ilya Sutskeve and Geoffrey E. Hinton. ImageNet classification with deep concovlutional Neural Networks. In NIPS, 2012.
[13]
Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
[14]
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.
[15]
Alex Krizhevsky. cudaconvet2: http://code.google.com/p/cuda-convnet2/, 2014.
[16]
Andrew Lavin and Scott Gray. Fast Algorithms For Convolutional Neural Netoworks. Arxiv. CoRR, abs/1509.09308, 2015.
[17]
Yann LeCun, Leou Bottou, Yoshua Bengio and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
[18]
Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 2015.
[19]
Michael Mathieu, Mikael Henaff and Yann LeCun. Fast training of convolutional networks through ffts. CoRR, abs/1312.5851.
[20]
Jimmy SJ. Ren and Li Xu. On vectorization of deep convolutional neural netowrks for vision tasks. In AAAI, 2015.
[21]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015
[22]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognization. In ICLR, 2015.
[23]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino and Yann LeCun. Fast convolutional nets with fbfft: a GPU performance evaluation. CoRR, abs/142.7580, 2014.
[24]
Yi Yang, Ping Xiang, Jingfei Kong and Huiyang Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, 2010.
[25]
Mattew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. In ECCV, 2014.
[26]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In FPGA, 2015.
[27]
NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library, https://developer.nvidia.com/cublas, 2013.
[28]
Kepler GK110 whitepaper: NVIDIA next generation CUDA compute architecture, 2012.
[29]
Convolutional Neural Nentwork: https://en.wikipedia.org/wiki/Convolutional_neural_network
[30]
NVIDIA CUDA 6.5 SDK Samples, NVIDIA, 2014
[31]
Convnet-benchmarks. https://github.com/soumith/convnet-benchmarks.
[32]
Tensor Processing Unit. https://en.wikipedia.org/wiki/Tensor_processing_unit
[33]
NVIDIA Tesla P100 Whitepaper, 2016.

Cited By

View all
  • (2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
  • (2023)ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning CompilationProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587440(199-214)Online publication date: 8-May-2023
  • (2022)A data-centric optimization framework for machine learningProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532364(1-13)Online publication date: 28-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2016
1034 pages
ISBN:9781467388153
  • Conference Chair:
  • John West

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

  1. GPU acceleration
  2. convolutional neural network
  3. data layout
  4. deep learning
  5. memory efficiency

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
  • (2023)ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning CompilationProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587440(199-214)Online publication date: 8-May-2023
  • (2022)A data-centric optimization framework for machine learningProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532364(1-13)Online publication date: 28-Jun-2022
  • (2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
  • (2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
  • (2021)Optimized Deep Learning Object Recognition for Drones using Embedded GPU2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA )10.1109/ETFA45728.2021.9613590(1-7)Online publication date: 7-Sep-2021
  • (2020)On the Limits of Parallelizing Convolutional Neural Networks on GPUsProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400266(567-569)Online publication date: 6-Jul-2020
  • (2019)TASOProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359630(47-62)Online publication date: 27-Oct-2019
  • (2019)Demystifying Parallel and Distributed Deep LearningACM Computing Surveys10.1145/332006052:4(1-43)Online publication date: 30-Aug-2019
  • (2019)OpenCL vsProceedings of the International Workshop on OpenCL10.1145/3318170.3318172(1-11)Online publication date: 13-May-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media