Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472456.3472473acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Optimizing Winograd-Based Convolution with Tensor Cores

Published: 05 October 2021 Publication History

Abstract

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity a lot, making the convolution computing fast. But existing implementations of Winograd convolution is limited to small tiles, i.e. F(4 × 4, 3 × 3) and F(2 × 2, 3 × 3) where 4 × 4 and 2 × 2 are tile sizes of output channels and 3 × 3 is the filter size, and single precision data. In this paper, we propose an optimized mixed precision F(6 × 6, 3 × 3) Winograd convolution implementation on NVIDIA Ampere GPUs using Tensor Cores. Our experiments show that the accuracy of mixed precision F(6 × 6, 3 × 3) Winograd convolution is sufficient to infer the convolutional neural networks. Besides, our method achieves up to 15.71x and 2.41x speedup on NVIDIA Ampere A100, compared with the state of the art Winograd based convolution and GEMM based convolution in cuDNN 8.1.0, respectively. Moreover, we integrate our F(6 × 6, 3 × 3) Winograd convolution implementation into NVIDIA TensorRT, which is a C++ inference library on GPUs provided by NVIDIA, as custom layer plugins. And we build the whole VGG network model using our custom Winograd convolution layers and other layers supported by TensorRT. The experiments show that the accuracy of the whole VGG network using our F(6 × 6, 3 × 3) Winograd convolution is 71.24%, while the accuracy of using FP32 computing for the VGG network is 71.22%.

References

[1]
Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2017. Low-memory GEMM-based convolution algorithms for deep neural networks. arxiv:1709.03395 [cs.CV]
[2]
O. Avilov, S. Rimbert, A. Popov, and L. Bougrain. 2020. Deep Learning Techniques to Improve Intraoperative Awareness Detection from Electroencephalographic Signals. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). 142–145. https://doi.org/10.1109/EMBC44109.2020.9176228
[3]
David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, and Nir Shavit. 2017. Deep Tensor Convolution on Multicores. arxiv:1611.06565 [cs.CV]
[4]
Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Université de Rennes 1, Suvisoft, La Baule (France). https://hal.inria.fr/inria-00112631http://www.suvisoft.com.
[5]
Minsik Cho and Daniel Brand. 2017. MEC: Memory-Efficient Convolution for Deep Neural Network. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 815–824.
[6]
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. 160–167.
[7]
Stephen A. Cook and Stål O. Aanderaa. 1969. On the Minimum Computation Time of Functions. Trans. Amer. Math. Soc. 142 (1969), 291–314. http://www.jstor.org/stable/1995359
[8]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arxiv:1412.7024 [cs.LG]
[9]
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
[10]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures. arxiv:1808.05567 [cs.DC]
[11]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arxiv:1502.02551 [cs.LG]
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]
[13]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv:1502.03167 [cs.LG]
[14]
Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP ’18). Association for Computing Machinery, New York, NY, USA, 109–123. https://doi.org/10.1145/3178487.3178496
[15]
Marc Jordà, Pedro Valero-Lara, and Antonio J. Peña. 2019. Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs. IEEE Access 7(2019), 70461–70473. https://doi.org/10.1109/ACCESS.2019.2918851
[16]
Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J. Song. 2020. Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 725–737. https://doi.org/10.1109/MICRO50266.2020.00065
[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May 2017), 84–90. https://doi.org/10.1145/3065386
[18]
J. Lai and A. Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–10. https://doi.org/10.1109/CGO.2013.6494986
[19]
Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arxiv:1509.09308 [cs.NE]
[20]
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2018). https://doi.org/10.1109/ipdpsw.2018.00091
[21]
Michael Mathieu, Mikael Henaff, and Yann LeCun. 2014. Fast Training of Convolutional Networks through FFTs. arxiv:1312.5851 [cs.CV]
[22]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arxiv:1710.03740 [cs.AI]
[23]
Tran Minh Quan, David G. C. Hildebrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arxiv:1612.05360 [cs.CV]
[24]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
[25]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
[26]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556 [cs.CV]
[27]
Zhuoran Song, Jianfei Wang, Tianjian Li, Li Jiang, Jing Ke, Xiaoyao Liang, and Naifeng Jing. 2020. GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores. In 2020 57th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC18072.2020.9218566
[28]
Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, Washington) (SC ’11). Association for Computing Machinery, New York, NY, USA, Article 35, 11 pages. https://doi.org/10.1145/2063384.2063431
[29]
Andrei L Toom. 1963. The complexity of a scheme of functional elements realizing the multiplication of integers. In Soviet Mathematics Doklady, Vol. 3. 714–716.
[30]
A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis. 2017. Forecasting Stock Prices from the Limit Order Book Using Convolutional Neural Networks. In 2017 IEEE 19th Conference on Business Informatics (CBI), Vol. 01. 7–12. https://doi.org/10.1109/CBI.2017.23
[31]
Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger(Eds.), Vol. 26. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf
[32]
Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
[33]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2015. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. arxiv:1412.7580 [cs.LG]
[34]
D. Yan, W. Wang, and X. Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/IPDPS47924.2020.00071
[35]
Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing Batched Winograd Convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California) (PPoPP ’20). Association for Computing Machinery, New York, NY, USA, 32–44. https://doi.org/10.1145/3332466.3374520
[36]
Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dyand Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Sweden, 5776–5785. http://proceedings.mlr.press/v80/zhang18d.html
[37]
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Austin, Texas, USA) (PPoPP ’17). Association for Computing Machinery, New York, NY, USA, 31–43. https://doi.org/10.1145/3018743.3018755
[38]
Aleksandar Zlateski, Kisuk Lee, and Hyunjune Sebastian Seung. 2016. ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs. In Proceedings of SC 2016(International Conference for High Performance Computing, Networking, Storage and Analysis, SC). IEEE Computer Society, United States, 854–865. https://doi.org/10.1109/SC.2016.72 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016 ; Conference date: 13-11-2016 Through 18-11-2016.
[39]
Aleksandar Zlateski, Kisuk Lee, and H. Sebastian Seung. 2016. ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2016). https://doi.org/10.1109/ipdps.2016.119

Cited By

View all
  • (2024)End-to-End Deployment of Winograd-Based DNNs on Edge GPUElectronics10.3390/electronics1322453813:22(4538)Online publication date: 19-Nov-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Winograd
  3. convolution optimization
  4. neural networks

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)132
  • Downloads (Last 6 weeks)8
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)End-to-End Deployment of Winograd-Based DNNs on Edge GPUElectronics10.3390/electronics1322453813:22(4538)Online publication date: 19-Nov-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • (2024)Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUsACM Transactions on Architecture and Code Optimization10.1145/363295621:1(1-26)Online publication date: 19-Jan-2024
  • (2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
  • (2022)WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory ArchitectureSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00059(1-14)Online publication date: Nov-2022
  • (2022)A Winograd-Based Highly-Parallel Convolution Engine for 8-bit CNN Acceleration2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS)10.1109/AICAS54282.2022.9869911(395-398)Online publication date: 13-Jun-2022
  • (2022)Optimizing small channel 3D convolution on GPU with tensor coreParallel Computing10.1016/j.parco.2022.102954113:COnline publication date: 1-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media