End-to-End Deployment of Winograd-Based DNNs on Edge GPU †
<p>The three steps of the <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> Winograd algorithm: (1) input and weight transformation, (2) element-wise matrix multiplication (EWMM) of the transformed matrices, and (3) inverse transformation to produce the spatial output feature maps. The numerical instability due to quantization is highlighted.</p> "> Figure 2
<p>Comparison of the (<b>a</b>) standard Winograd quantized transformation against (<b>b</b>) the Winograd quantized transformation that leverages trainable clipping factors to better exploit the quantized range.</p> "> Figure 3
<p>Overview of the proposed Winograd aware quantized training. Straight-through estimator (STE) is used to approximate the gradient of the quantization function. Trainable clipping factors <span class="html-italic">c</span>, <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>a</mi> </mrow> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>w</mi> </mrow> </msub> </semantics></math> are highlighted in <span style="color: #FF0000">red</span>.</p> "> Figure 4
<p>Input transformation kernel overview. The input volume is divided in sub-volumes and each thread block is responsible for the transformation of a sub-volume.</p> "> Figure 5
<p>Element-wise matrix multiplication kernel overview. The computation is organized in <math display="inline"><semantics> <mrow> <mn>6</mn> <mo>×</mo> <mn>6</mn> </mrow> </semantics></math> GEMMs. Each one is responsible for the computation of <math display="inline"><semantics> <mrow> <msub> <mi>N</mi> <mrow> <mi>t</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mi>s</mi> </mrow> </msub> <mo>×</mo> <msub> <mi>C</mi> <mi>o</mi> </msub> </mrow> </semantics></math> output pixels in the Winograd domain.</p> "> Figure 6
<p>Inverse transformation kernel overview. The Winograd tiles produced by the EWMM kernel are transformed back to the spatial domain. Each thread block is responsible for the computation of a <math display="inline"><semantics> <mrow> <mn>4</mn> <mo>×</mo> <mn>4</mn> <mo>×</mo> <msub> <mi>P</mi> <mrow> <mi>o</mi> <mi>c</mi> </mrow> </msub> </mrow> </semantics></math> output pixel.</p> "> Figure 7
<p>Numerical distributions of example layers for transformed weights and activations of ResNet-20 on CIFAR-10. The values in the clipped range (green) sufficiently contain the information needed to maintain high-accuracy full 8-bit Winograd.</p> "> Figure 8
<p>Latency speedup brought by the custom Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> kernels compared to cuDNN convolution on Tensor Cores (<tt>int8x32</tt>).</p> "> Figure 9
<p>The latency contribution of each of the three steps in the Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> algorithm. In each sub-figure, the spatial dimensions are fixed, while the channel dimensions are varied.</p> ">
Abstract
:1. Introduction
- We model the numerical error of quantized Winograd at training time, making the model aware of quantization errors and overflows in the Winograd domain.
- We introduced a trainable clipping factor for quantizing transformed parameters in the Winograd domain, resulting in a MAC operation reduction of 2.45× for ResNet-18-ImageNet with only∼1 p.p. accuracy degradation.
- We designed an optimized 8-bit CUDA kernel for the F(4×4, 3×3) variant of the Winograd algorithms on an edge GPU. We took advantage of the efficient Tensor Cores to further speed-up the quantized algorithm, resulting in up to 3.41× latency reduction compared to the standard convolutional algorithm.
2. Related Works
2.1. Post-Training Winograd-Based Quantized CNNs
2.2. Winograd-Aware Training (WAT)
2.3. Winograd Algorithm on GPU
3. Materials and Methods
3.1. Quantized Convolutional Algorithm
3.2. Winograd Algorithm
3.3. Clipping Factors in the Winograd Domain
3.3.1. Trainable Clipping Factors ,
3.4. Winograd F(4,3) Convolution on GPU
3.4.1. Input Transform
3.4.2. Element-Wise Matrix Multiplication
3.4.3. Output Transform
4. Experiments
4.1. Quantized Winograd with Clipping Factors
4.2. Effect of Clipping Factors
4.3. Winograd GPU Kernel Speedup
4.4. Contribution of Each Step to the Latency
4.5. Layer-Wise Latency Comparison
5. Conclusions and Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
EWMM | Element-Wise Matrix Multiplication |
PTQ | Post-Training Quantization |
QAT | Quantization-Aware Training |
MAC | Multiply and Accumulate |
WAT | Winograd-Aware Training |
FP | Floating Point |
STE | Straight-Through Estimator |
GEMM | General Matrix Multiply |
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Implementing the Tensorflow Deep Learning Framework on Qualcomm’s Low-Power DSP. 2020. Available online: https://www.edge-ai-vision.com/2017/07/implementing-the-tensorflow-deep-learning-framework-on-qualcomms-low-power-dsp-a-presentation-from-google/ (accessed on 23 October 2024).
- Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv 2018, arXiv:1805.06085. [Google Scholar]
- Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
- Khalil, K.; Eldash, O.; Kumar, A.; Bayoumi, M. Designing Novel AAD Pooling in Hardware for a Convolutional Neural Network Accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2022, 30, 303–314. [Google Scholar] [CrossRef]
- Nvidia Jetson Nano Developer KIT. 2019. Available online: https://cdn.sparkfun.com/assets/0/7/f/9/d/jetson-nano-devkit-datasheet-updates-us-v3.pdf (accessed on 17 July 2024).
- Snapdragon 8 Gen 1 Mobile Platform. 2021. Available online: https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/snapdragon-8-gen-1-mobile-platform-product-brief.pdf (accessed on 23 October 2024).
- Google Coral. 2020. Available online: https://coral.ai/static/files/Coral-M2-Dual-EdgeTPU-datasheet.pdf (accessed on 23 October 2024).
- Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
- Maji, P.; Mundy, A.; Dasika, G.; Beu, J.; Mattina, M.; Mullins, R. Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs. arXiv 2019, arXiv:1903.01521. [Google Scholar]
- Alam, S.A.; Anderson, A.; Barabasz, B.; Gregg, D. Winograd Convolution for Deep Neural Networks: Efficient Point Selection. arXiv 2022, arXiv:2201.10369. [Google Scholar] [CrossRef]
- Barabasz, B.; Anderson, A.; Soodhalter, K.M.; Gregg, D. Error analysis and improving the accuracy of Winograd convolution for deep neural networks. ACM Trans. Math. Softw. (TOMS) 2020, 46, 1–33. [Google Scholar] [CrossRef]
- Kim, M.; Park, C.; Kim, S.; Hong, T.; Ro, W.W. Efficient Dilated-Winograd Convolutional Neural Networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2711–2715. [Google Scholar] [CrossRef]
- Jiang, J.; Chen, X.; Tsui, C.Y. A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters. arXiv 2021, arXiv:2102.13272. [Google Scholar]
- Yang, C.; Wang, Y.; Wang, X.; Geng, L. WRA: A 2.2-to-6.3 TOPS highly unified dynamically reconfigurable accelerator using a novel Winograd decomposition algorithm for convolutional neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 66, 3480–3493. [Google Scholar] [CrossRef]
- Liu, X.; Pool, J.; Han, S.; Dally, W.J. Efficient Sparse-Winograd Convolutional Neural Networks. arXiv 2018, arXiv:1802.06367. [Google Scholar]
- Yang, T.; Liao, Y.; Shi, J.; Liang, Y.; Jing, N.; Jiang, L. A Winograd-Based CNN Accelerator with a Fine-Grained Regular Sparsity Pattern. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 August–4 September 2020; pp. 254–261. [Google Scholar] [CrossRef]
- Fernandez-Marques, J.; Whatmough, P.; Mundy, A.; Mattina, M. Searching for Winograd-aware Quantized Networks. In Proceedings of the Machine Learning and Systems, Austin, TX, USA, 2–4 March 2020; Dhillon, I., Papailiopoulos, D., Sze, V., Eds.; Volume 2, pp. 14–29. [Google Scholar]
- Mori, P.; Frickenstein, L.; Sampath, S.B.; Thoma, M.; Fasfous, N.; Vemparala, M.R.; Frickenstein, A.; Unger, C.; Stechele, W.; Mueller-Gritschneder, D.; et al. Wino Vidi Vici: Conquering Numerical Instability of 8-Bit Winograd Convolution for Accurate Inference Acceleration on Edge. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 53–62. [Google Scholar]
- Li, G.; Jia, Z.; Feng, X.; Wang, Y. Lowino: Towards efficient low-precision winograd convolutions on modern cpus. In Proceedings of the 50th International Conference on Parallel Processing, Lemont, IL, USA, 9–12 August 2021; pp. 1–11. [Google Scholar]
- Chikin, V.; Kryzhanovskiy, V. Channel Balancing for Accurate Quantization of Winograd Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Barabasz, B. Quantaized winograd/toom-cook convolution for dnns: Beyond canonical polynomials base. arXiv 2020, arXiv:2004.11077. [Google Scholar]
- Andri, R.; Bussolino, B.; Cipolletta, A.; Cavigelli, L.; Wang, Z. Going Further with Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4 × 4 Tiles. In Proceedings of the 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA, 1–5 October 2022; pp. 582–598. [Google Scholar]
- Castro, R.L.; Andrade, D.; Fraguela, B.B. OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA. Mathematics 2021, 9, 2033. [Google Scholar] [CrossRef]
- Liu, J.; Yang, D.; Lai, J. Optimizing Winograd-Based Convolution with Tensor Cores. In Proceedings of the 50th International Conference on Parallel Processing, New York, NY, USA, 9–12 August 2021. ICPP’21. [Google Scholar] [CrossRef]
- Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar]
- Nvidia. cuBLAS. 2019. Available online: https://docs.nvidia.com/cuda/cublas/index.html (accessed on 23 October 2024).
- Nvidia. Tensor Cores. 2019. Available online: https://www.nvidia.com/en-us/data-center/tensor-cores/ (accessed on 23 October 2024).
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Scharwächter, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset. In Proceedings of the CVPR Workshop on the Future of Datasets in Vision, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Nvidia. Jetpack. 2024. Available online: https://developer.nvidia.com/embedded/jetpack (accessed on 4 November 2024).
Algorithm | Weight | Reduction | ||
---|---|---|---|---|
Memory | Theoretical (S) | ResNet-18 | ||
F(2,3) | 4× | 1.78× | 2.25× | 1.76× |
F(3,3) | 36× | 2.78× | 3.24× | 2.05× |
F(4,3) | 100× | 4× | 4× | 2.45× |
F(6,3) | 156.25× | 7.1× | 5.06× | 2.24× |
Dataset | Model | Method | QAT/ | Winograd | Top-1 | |||
---|---|---|---|---|---|---|---|---|
WAT | Algorithm | Clipping | Saving | [%] | ||||
Cifar-10 [32] | ResNet-20 [1] | Conv [1] | 32 | ✗ | - | - | - | 91.61 |
QConv [5] | 8 | ✓ | - | - | - | 91.39 | ||
WinoQConv | 8 | ✗ | F(4,3) | ✗ | 3.4× | 35.36 | ||
Ours | 8 | ✗ | F(4,3) | ✓ | 3.4× | 82.11 | ||
✓ | F(4,3) | ✗ | 3.4× | 89.69 | ||||
✓ | F(4,3) | ✓ | 3.4× | 90.89 | ||||
VGG-9 [31] | QConv | 8 | ✓ | - | - | - | 93.11 | |
WinoQConv | 8 | ✓ | F(4,3) | ✗ | 3.84× | 88.97 | ||
Ours | 8 | ✓ | F(4,3) | ✓ | 3.84× | 92.29 | ||
Imagenet [33] | ResNet-18 [1] | Conv [1] | 32 | ✗ | - | - | - | 71.00 |
QConv [5] | 8 | ✓ | - | - | - | 70.54 | ||
WinoQConv | 8 | ✗ | F(4,3) | ✗ | 2.45× | 5.45 | ||
Ours | 8 | ✓ | F(4,3) | ✗ | 2.45× | 65.71 | ||
✓ | F(4,3) | ✓ | 2.45× | 69.14 | ||||
CityScapes [34] | DeepLabV3+ [3] | QConv [5] | 8 | ✓ | - | - | - | 67.82 |
Ours | 8 | ✓ | F(4,3) | ✓ | 2.56× | 66.57 |
Name | Layer Config | Latency [ms] | Speedup | ||||||
---|---|---|---|---|---|---|---|---|---|
int8x4 | int8x32 | Ours | Best | [×] | |||||
conv2_block1_2 | 64 | 64 | 512 | 256 | 25.84 | 5.08 | 13.76 | 5.08 | 1.00 |
conv2_block2_1 | 64 | 64 | 512 | 256 | 25.84 | 5.08 | 13.76 | 5.08 | 1.00 |
conv2_block2_2 | 64 | 64 | 512 | 256 | 25.84 | 5.08 | 13.76 | 5.08 | 1.00 |
conv3_block1_2 | 128 | 128 | 256 | 128 | 25.13 | 4.79 | 7.05 | 4.79 | 1.00 |
conv3_block2_1 | 128 | 128 | 256 | 128 | 25.13 | 4.79 | 7.05 | 4.79 | 1.00 |
conv3_block2_2 | 128 | 128 | 256 | 128 | 25.13 | 4.79 | 7.05 | 4.79 | 1.00 |
conv4_block1_2 | 256 | 256 | 128 | 64 | 24.80 | 4.57 | 3.96 | 3.96 | 1.15 |
conv4_block2_1 | 256 | 256 | 128 | 64 | 24.80 | 4.57 | 3.96 | 3.96 | 1.15 |
conv4_block2_2 | 256 | 256 | 128 | 64 | 24.80 | 4.57 | 3.96 | 3.96 | 1.15 |
conv5_block1_1 | 256 | 512 | 128 | 64 | 49.54 | 9.07 | 6.22 | 6.22 | 1.46 |
conv5_block1_2 | 512 | 512 | 128 | 64 | 98.33 | 17.82 | 8.49 | 8.49 | 2.10 |
conv5_block2_1 | 512 | 512 | 128 | 64 | 98.33 | 17.82 | 8.49 | 8.49 | 2.10 |
conv5_block2_2 | 512 | 512 | 128 | 64 | 98.33 | 17.82 | 8.49 | 8.49 | 2.10 |
571.84 | 105.85 | 106.00 | 73.18 | 1.44 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mori, P.; Rahman, M.S.; Frickenstein, L.; Sampath, S.B.; Thoma, M.; Fasfous, N.; Vemparala, M.R.; Frickenstein, A.; Stechele, W.; Passerone, C. End-to-End Deployment of Winograd-Based DNNs on Edge GPU. Electronics 2024, 13, 4538. https://doi.org/10.3390/electronics13224538
Mori P, Rahman MS, Frickenstein L, Sampath SB, Thoma M, Fasfous N, Vemparala MR, Frickenstein A, Stechele W, Passerone C. End-to-End Deployment of Winograd-Based DNNs on Edge GPU. Electronics. 2024; 13(22):4538. https://doi.org/10.3390/electronics13224538
Chicago/Turabian StyleMori, Pierpaolo, Mohammad Shanur Rahman, Lukas Frickenstein, Shambhavi Balamuthu Sampath, Moritz Thoma, Nael Fasfous, Manoj Rohit Vemparala, Alexander Frickenstein, Walter Stechele, and Claudio Passerone. 2024. "End-to-End Deployment of Winograd-Based DNNs on Edge GPU" Electronics 13, no. 22: 4538. https://doi.org/10.3390/electronics13224538
APA StyleMori, P., Rahman, M. S., Frickenstein, L., Sampath, S. B., Thoma, M., Fasfous, N., Vemparala, M. R., Frickenstein, A., Stechele, W., & Passerone, C. (2024). End-to-End Deployment of Winograd-Based DNNs on Edge GPU. Electronics, 13(22), 4538. https://doi.org/10.3390/electronics13224538