research-article

Optimizing memory efficiency for deep convolutional neural networks on GPUs

Authors:

Srimat Chakradhar,

Huiyang ZhouAuthors Info & Claims

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 54, Pages 1 - 12

Published: 13 November 2016 Publication History

Abstract

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-the-art recognition accuracy. Due to the substantial compute and memory operations, however, they require significant execution time. The massive parallel computing capability of GPUs make them as one of the ideal platforms to accelerate CNNs and a number of GPU-based CNN libraries have been developed. While existing works mainly focus on the computational efficiency of CNNs, the memory efficiency of CNNs have been largely overlooked. Yet CNNs have intricate data structures and their memory behavior can have significant impact on the performance. In this work, we study the memory efficiency of various CNN layers and reveal the performance implication from both data layouts and memory access patterns. Experiments show the universal effect of our proposed optimizations on both single layers and various networks, with up to 27.9x for a single layer and up to 5.6x on the whole networks.

References

[1]

James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python. In SCIPY, 2010.

[2]

Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula and Srihari Cadambi. A dynamic configurable coprocessor for convolutional neural networks. In ISCA, 2010.

Digital Library

[3]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chenyong Wu and Yunji Chen. DianNao: A Small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014.

Digital Library

[4]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonanthan Cohen, John Tran, Bryan Catanzaro and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. CoRR, abs/1410.0759.

[5]

Dan C. Ciresan, Ueli Meier, Jonathan Masci, Luca M. Gambardella and Jurgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI, 2011.

Digital Library

[6]

Adam Coates, Brody Huval, Tao Wang, David J. Wu and Andrew Y. Ng. Deep learning with COTS HPC systems. In ICML, 2013.

[7]

Ronan Collobert, Koray Kavakcuoglu and Clement Farabet. Torch7: A matlab-like environment for machine learning. In NIPSW, 2011.

[8]

Jason Cong and Bingjun Bao. Minimizing computation in convolutional neural networks. In ICANN 2014.

[9]

Marc Gonzalez Tallada. Coarse Grain Parallelization of Deep Neural Networks. In PPoPP, 2016.

Digital Library

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. CorRR, abs/1502.01852.

[11]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama and Trevor Darrel. Caffe: convolutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014.

[12]

Alex Krizhevsky, Ilya Sutskeve and Geoffrey E. Hinton. ImageNet classification with deep concovlutional Neural Networks. In NIPS, 2012.

Digital Library

[13]

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.

[14]

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.

[15]

Alex Krizhevsky. cudaconvet2: http://code.google.com/p/cuda-convnet2/, 2014.

[16]

Andrew Lavin and Scott Gray. Fast Algorithms For Convolutional Neural Netoworks. Arxiv. CoRR, abs/1509.09308, 2015.

[17]

Yann LeCun, Leou Bottou, Yoshua Bengio and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.

[18]

Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 2015.

[19]

Michael Mathieu, Mikael Henaff and Yann LeCun. Fast training of convolutional networks through ffts. CoRR, abs/1312.5851.

[20]

Jimmy SJ. Ren and Li Xu. On vectorization of deep convolutional neural netowrks for vision tasks. In AAAI, 2015.

[21]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015

Digital Library

[22]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognization. In ICLR, 2015.

[23]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino and Yann LeCun. Fast convolutional nets with fbfft: a GPU performance evaluation. CoRR, abs/142.7580, 2014.

[24]

Yi Yang, Ping Xiang, Jingfei Kong and Huiyang Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, 2010.

Digital Library

[25]

Mattew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. In ECCV, 2014.

[26]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In FPGA, 2015.

Digital Library

[27]

NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library, https://developer.nvidia.com/cublas, 2013.

[28]

Kepler GK110 whitepaper: NVIDIA next generation CUDA compute architecture, 2012.

[29]

Convolutional Neural Nentwork: https://en.wikipedia.org/wiki/Convolutional_neural_network

[30]

NVIDIA CUDA 6.5 SDK Samples, NVIDIA, 2014

[31]

Convnet-benchmarks. https://github.com/soumith/convnet-benchmarks.

[32]

Tensor Processing Unit. https://en.wikipedia.org/wiki/Tensor_processing_unit

[33]

NVIDIA Tesla P100 Whitepaper, 2016.

Cited By

Qararyah FAzhar MMaleki MTrancoso P(2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678153
Xu ZXu JPeng HWang WWang XWan HDai HXu YCheng HWang KChen GFedorova ANarayanan DDi Luna GQuerzoni L(2023)ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning CompilationProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587440(199-214)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587440
Rausch OBen-Nun TDryden NIvanov ALi SHoefler TRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)A data-centric optimization framework for machine learningProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532364(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532364
Show More Cited By

Recommendations

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Research on improved wavelet convolutional wavelet neural networks
Abstract
Convolutional neural network (CNN) is recognized as state of the art of deep learning algorithm, which has a good ability on the image classification and recognition. The problems of CNN are as follows: the precision, accuracy and efficiency of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2016

1034 pages

ISBN:9781467388153

Conference Chair:
John West
University of Texas at Austin

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGARCH
IEEE-CS

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
445
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qararyah FAzhar MMaleki MTrancoso P(2024)Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUsWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678153(58-67)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678153
Xu ZXu JPeng HWang WWang XWan HDai HXu YCheng HWang KChen GFedorova ANarayanan DDi Luna GQuerzoni L(2023)ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning CompilationProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587440(199-214)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587440
Rausch OBen-Nun TDryden NIvanov ALi SHoefler TRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)A data-centric optimization framework for machine learningProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532364(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532364
Chatarasi PKwon HParashar APellauer MKrishna TSarkar V(2021)Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial AcceleratorsACM Transactions on Architecture and Code Optimization10.1145/348513719:1(1-26)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3485137
Li RXu YSukumaran-Rajam ARountev ASadayappan PSherwood TBerger EKozyrakis C(2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446759
Rad PHofmann DPertuz Mendez SGoehringer D(2021)Optimized Deep Learning Object Recognition for Drones using Embedded GPU2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA )10.1109/ETFA45728.2021.9613590(1-7)Online publication date: 7-Sep-2021
https://dl.acm.org/doi/10.1109/ETFA45728.2021.9613590
Pourghassemi BZhang CLee JChandramowlishwaran AScheideler CSpear M(2020)On the Limits of Parallelizing Convolutional Neural Networks on GPUsProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400266(567-569)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400266
Jia ZPadon OThomas JWarszawski TZaharia MAiken ABrecht TWilliamson C(2019)TASOProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359630(47-62)Online publication date: 27-Oct-2019
https://dl.acm.org/doi/10.1145/3341301.3359630
Ben-Nun THoefler T(2019)Demystifying Parallel and Distributed Deep LearningACM Computing Surveys10.1145/332006052:4(1-43)Online publication date: 30-Aug-2019
https://dl.acm.org/doi/10.1145/3320060
Renney HGaster BMitchell T(2019)OpenCL vsProceedings of the International Workshop on OpenCL10.1145/3318170.3318172(1-11)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3318170.3318172
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents