research-article

Optimizing Winograd-Based Convolution with Tensor Cores

Authors:

Junjie LaiAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 84, Pages 1 - 10

https://doi.org/10.1145/3472456.3472473

Published: 05 October 2021 Publication History

Abstract

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity a lot, making the convolution computing fast. But existing implementations of Winograd convolution is limited to small tiles, i.e. F(4 × 4, 3 × 3) and F(2 × 2, 3 × 3) where 4 × 4 and 2 × 2 are tile sizes of output channels and 3 × 3 is the filter size, and single precision data. In this paper, we propose an optimized mixed precision F(6 × 6, 3 × 3) Winograd convolution implementation on NVIDIA Ampere GPUs using Tensor Cores. Our experiments show that the accuracy of mixed precision F(6 × 6, 3 × 3) Winograd convolution is sufficient to infer the convolutional neural networks. Besides, our method achieves up to 15.71x and 2.41x speedup on NVIDIA Ampere A100, compared with the state of the art Winograd based convolution and GEMM based convolution in cuDNN 8.1.0, respectively. Moreover, we integrate our F(6 × 6, 3 × 3) Winograd convolution implementation into NVIDIA TensorRT, which is a C++ inference library on GPUs provided by NVIDIA, as custom layer plugins. And we build the whole VGG network model using our custom Winograd convolution layers and other layers supported by TensorRT. The experiments show that the accuracy of the whole VGG network using our F(6 × 6, 3 × 3) Winograd convolution is 71.24%, while the accuracy of using FP32 computing for the VGG network is 71.22%.

References

[1]

Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2017. Low-memory GEMM-based convolution algorithms for deep neural networks. arxiv:1709.03395 [cs.CV]

[2]

O. Avilov, S. Rimbert, A. Popov, and L. Bougrain. 2020. Deep Learning Techniques to Improve Intraoperative Awareness Detection from Electroencephalographic Signals. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). 142–145. https://doi.org/10.1109/EMBC44109.2020.9176228

[3]

David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, and Nir Shavit. 2017. Deep Tensor Convolution on Multicores. arxiv:1611.06565 [cs.CV]

[4]

Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Guy Lorette (Ed.). Université de Rennes 1, Suvisoft, La Baule (France). https://hal.inria.fr/inria-00112631http://www.suvisoft.com.

[5]

Minsik Cho and Daniel Brand. 2017. MEC: Memory-Efficient Convolution for Deep Neural Network. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17). JMLR.org, 815–824.

[6]

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. 160–167.

Digital Library

[7]

Stephen A. Cook and Stål O. Aanderaa. 1969. On the Minimum Computation Time of Functions. Trans. Amer. Math. Soc. 142 (1969), 291–314. http://www.jstor.org/stable/1995359

[8]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Training deep neural networks with low precision multiplications. arxiv:1412.7024 [cs.LG]

[9]

J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848

[10]

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures. arxiv:1808.05567 [cs.DC]

[11]

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arxiv:1502.02551 [cs.LG]

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]

[13]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arxiv:1502.03167 [cs.LG]

[14]

Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP ’18). Association for Computing Machinery, New York, NY, USA, 109–123. https://doi.org/10.1145/3178487.3178496

Digital Library

[15]

Marc Jordà, Pedro Valero-Lara, and Antonio J. Peña. 2019. Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs. IEEE Access 7(2019), 70461–70473. https://doi.org/10.1109/ACCESS.2019.2918851

[16]

Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J. Song. 2020. Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 725–737. https://doi.org/10.1109/MICRO50266.2020.00065

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May 2017), 84–90. https://doi.org/10.1145/3065386

Digital Library

[18]

J. Lai and A. Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–10. https://doi.org/10.1109/CGO.2013.6494986

Digital Library

[19]

Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. arxiv:1509.09308 [cs.NE]

[20]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (May 2018). https://doi.org/10.1109/ipdpsw.2018.00091

[21]

Michael Mathieu, Mikael Henaff, and Yann LeCun. 2014. Fast Training of Convolutional Networks through FFTs. arxiv:1312.5851 [cs.CV]

[22]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arxiv:1710.03740 [cs.AI]

[23]

Tran Minh Quan, David G. C. Hildebrand, and Won-Ki Jeong. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. arxiv:1612.05360 [cs.CV]

[24]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

[25]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556 [cs.CV]

[27]

Zhuoran Song, Jianfei Wang, Tianjian Li, Li Jiang, Jing Ke, Xiaoyao Liang, and Naifeng Jing. 2020. GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores. In 2020 57th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1109/DAC18072.2020.9218566

[28]

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, Washington) (SC ’11). Association for Computing Machinery, New York, NY, USA, Article 35, 11 pages. https://doi.org/10.1145/2063384.2063431

Digital Library

[29]

Andrei L Toom. 1963. The complexity of a scheme of functional elements realizing the multiplication of integers. In Soviet Mathematics Doklady, Vol. 3. 714–716.

[30]

A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and A. Iosifidis. 2017. Forecasting Stock Prices from the Limit Order Book Using Convolutional Neural Networks. In 2017 IEEE 19th Conference on Business Informatics (CBI), Vol. 01. 7–12. https://doi.org/10.1109/CBI.2017.23

[31]

Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger(Eds.), Vol. 26. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf

[32]

Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.

[33]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2015. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. arxiv:1412.7580 [cs.LG]

[34]

D. Yan, W. Wang, and X. Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 634–643. https://doi.org/10.1109/IPDPS47924.2020.00071

[35]

Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing Batched Winograd Convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California) (PPoPP ’20). Association for Computing Machinery, New York, NY, USA, 32–44. https://doi.org/10.1145/3332466.3374520

Digital Library

[36]

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dyand Andreas Krause (Eds.). PMLR, StockholmsmÃ¤ssan, Stockholm Sweden, 5776–5785. http://proceedings.mlr.press/v80/zhang18d.html

[37]

Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Austin, Texas, USA) (PPoPP ’17). Association for Computing Machinery, New York, NY, USA, 31–43. https://doi.org/10.1145/3018743.3018755

Digital Library

[38]

Aleksandar Zlateski, Kisuk Lee, and Hyunjune Sebastian Seung. 2016. ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs. In Proceedings of SC 2016(International Conference for High Performance Computing, Networking, Storage and Analysis, SC). IEEE Computer Society, United States, 854–865. https://doi.org/10.1109/SC.2016.72 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016 ; Conference date: 13-11-2016 Through 18-11-2016.

[39]

Aleksandar Zlateski, Kisuk Lee, and H. Sebastian Seung. 2016. ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines. 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2016). https://doi.org/10.1109/ipdps.2016.119

Cited By

Mori PRahman MFrickenstein LSampath SThoma MFasfous NVemparala MFrickenstein AStechele WPasserone C(2024)End-to-End Deployment of Winograd-Based DNNs on Edge GPUElectronics10.3390/electronics1322453813:22(4538)Online publication date: 19-Nov-2024
https://doi.org/10.3390/electronics13224538
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Show More Cited By

Recommendations

Optimizing Massively Parallel Winograd Convolution on ARM Processor
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Convolution Neural Network (CNN) has gained a great success in deep learning applications and been accelerated by dedicated convolutional algorithms. Winograd-based algorithm can greatly reduce the number of arithmetic operations required in convolution. ...
Optimizing batched winograd convolution on GPUs
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on ...
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
506
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)8

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mori PRahman MFrickenstein LSampath SThoma MFasfous NVemparala MFrickenstein AStechele WPasserone C(2024)End-to-End Deployment of Winograd-Based DNNs on Edge GPUElectronics10.3390/electronics1322453813:22(4538)Online publication date: 19-Nov-2024
https://doi.org/10.3390/electronics13224538
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Wang XLi GJia ZFeng XWang Y(2024)Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUsACM Transactions on Architecture and Code Optimization10.1145/363295621:1(1-26)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3632956
Sun WLi AGeng TStuijk SCorporaal H(2023)Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric BehaviorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321782434:1(246-261)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3217824
Yang DLiu JQi JLai J(2022)WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory ArchitectureSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00059(1-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00059
Chen YOu YHuang C(2022)A Winograd-Based Highly-Parallel Convolution Engine for 8-bit CNN Acceleration2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS)10.1109/AICAS54282.2022.9869911(395-398)Online publication date: 13-Jun-2022
https://doi.org/10.1109/AICAS54282.2022.9869911
Jiang JHuang DDu JLu YLiao X(2022)Optimizing small channel 3D convolution on GPU with tensor coreParallel Computing10.1016/j.parco.2022.102954113:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.parco.2022.102954

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents