Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Design and Implementation of Deep Learning 2D Convolutions on Modern CPUs

Published: 04 October 2023 Publication History

Abstract

In this article, a new method is provided for accelerating the execution of convolution layers in Deep Neural Networks. This research work provides the theoretical background to efficiently design and implement the convolution layers on x86/x64 CPUs, based on the target layer parameters, quantization level and hardware architecture. The proposed work is general and can be applied to other processor families too, e.g., Arm. The proposed work achieves high speedup values over the state of the art, which is Intel oneDNN library, by applying compiler optimizations, such as vectorization, register blocking and loop tiling, in a more efficient way. This is achieved by developing an analytical modelling approach for finding the optimization parameters. A thorough experimental evaluation has been applied on two Intel CPU platforms, for DenseNet-121, ResNet-50 and SqueezeNet (including 112 different convolution layers), and for both FP32 and int8 input/output tensors (quantization). The experimental results show that the convolution layers of the aforementioned models are executed from <inline-formula><tex-math notation="LaTeX">$x1.1$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>x</mml:mi><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="kelefouras-ieq1-3322037.gif"/></alternatives></inline-formula> up to <inline-formula><tex-math notation="LaTeX">$x7.2$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>x</mml:mi><mml:mn>7</mml:mn><mml:mo>.</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="kelefouras-ieq2-3322037.gif"/></alternatives></inline-formula> times faster.

References

[1]
E. Georganas et al., “Anatomy of high-performance deep learning convolutions on SIMD architectures,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2018, Art. no.
[2]
A. Jayasimhan and P. Pabitha, “A comparison between CPU and GPU for image classification using convolutional neural networks,” in Proc. Int. Conf. Commun., Comput. Internet Things, 2022, pp. 1–4.
[3]
Z. Gong, H. Ji, C. W. Fletcher, C. J. Hughes, and J. Torrellas, “SparseTrain: Leveraging dynamic sparsity in software for training DNNs on general-purpose SIMD processors,” in Proc. ACM Int. Conf. Parallel Architectures Compilation Techn., New York, USA, 2020, pp. 279–292.
[4]
N. Tollenaere et al., “Autotuning convolutions is easier than you think,” ACM Trans. Architecture Code Optim., vol. 20, Nov. 2022, Art. no.
[5]
Intel, “oneDNN v2.7.0 documentation, understanding memory formats,” Oct. 2022. [Online]. Available: https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html
[6]
Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” 2017,.
[7]
A. Frickenstein, M. R. Vemparala, C. Unger, F. Ayar, and W. Stechele, “DSC: Dense-sparse convolution for vectorized inference of convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 1353–1360.
[8]
Q. Wang, D. Li, X. Huang, S. Shen, S. Mei, and J. Liu, “Optimizing FFT-based convolution on ARMv8 multi-core CPUs,” in Proc. Eur. Conf. Parallel Process., 2020, pp. 248–262.
[9]
D. M. Budden, A. Matveev, S. Santurkar, S. R. Chaudhuri, and N. Shavit, “Deep tensor convolution on multicores,” 2016,.
[10]
Z. Jia, A. Zlateski, F. Durand, and K. Li, “Optimizing N-dimensional, winograd-based convolution for manycore CPUs,” in Proc. 23rd ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 109–123.
[11]
M. Dukhan, “The indirect convolution algorithm,” 2019,.
[12]
J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-memory overhead direct convolutions,” 2018,.
[13]
V. Kelefouras and G. Keramidas, “Design and implementation of 2D convolution on x86/x64 processors,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 12, pp. 3800–3815, Dec. 2022.
[14]
L. Ismail and D. Guerchi, “Performance evaluation of convolution on the cell broadband engine processor,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 2, pp. 337–351, Feb. 2011.
[15]
V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover, “Diesel: DSL for linear algebra and neural net computations on GPUs,” in Proc. 2nd ACM SIGPLAN Int. Workshop Mach. Learn. Program. Lang., New York, NY, USA, 2018, pp. 42–51.
[16]
T. Chen et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” in Proc. 13th USENIX Conf. Operating Syst. Des. Implementation, USA, 2018, pp. 579–594.
[17]
L. Zheng et al., “Ansor: Generating high-performance tensor programs for deep learning,” in Proc. 14th USENIX Conf. Operating Syst. Des. Implementation, 2020, Art. no.
[18]
R. Li, Y. Xu, A. Sukumaran-Rajam, A. Rountev, and P. Sadayappan, “Analytical characterization and design space exploration for optimization of CNNs,” in Proc. 26th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA, 2021, pp. 928–942.
[19]
T. R. Patabandi, A. Venkat, A. Kulkarni, P. Ratnalikar, M. Hall, and J. Gottschlich, “Predictive data locality optimization for higher-order tensor computations,” in Proc. 5th ACM SIGPLAN Int. Symp. Mach. Program., New York, NY, USA, 2021, pp. 43–52.
[20]
C. Narendra et al., “Efficient and generic 1D dilated convolution layer for deep learning,” 2021,.
[21]
H. Kataoka, K. Yamashita, Y. Ito, K. Nakano, A. Kasagi, and T. Tabaru, “An efficient multicore CPU implementation for convolution-pooling computation in CNNs,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops, 2020, pp. 548–556.
[22]
S.-J. Lee, S.-S. Park, and K.-S. Chung, “Efficient SIMD implementation for accelerating convolutional neural network,” in Proc. 4th Int. Conf. Commun. Inf. Process., New York, NY, USA, 2018, pp. 174–179.
[23]
A. Tabuchi et al., “The 16,384-node parallelism of 3D-CNN training on an arm CPU based supercomputer,” in Proc. IEEE 28th Int. Conf. High Perform. Comput. Data Analytics, 2021, pp. 152–161.
[24]
R. Hao et al., “Towards effective depthwise convolutions on ARMv8 architecture,” 2022,.
[25]
V. Kelefouras, K. Djemame, G. Keramidas, and N. Voros, “A methodology for efficient tile size selection for affine loop kernels,” Int. J. Parallel Program., vol. 50, no. 3/4, pp. 405–432, 2022.
[26]
Intel, “Performance profiling example,” Nov. 2022. [Online]. Available: https://oneapi-src.github.io/oneDNN/page_performance_profiling_cpp.html

Cited By

View all
  • (2024)Register Blocking: An Analytical Modelling Approach for Affine Loop KernelsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649194(71-79)Online publication date: 7-May-2024

Index Terms

  1. Design and Implementation of Deep Learning 2D Convolutions on Modern CPUs
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Parallel and Distributed Systems
      IEEE Transactions on Parallel and Distributed Systems  Volume 34, Issue 12
      Dec. 2023
      311 pages

      Publisher

      IEEE Press

      Publication History

      Published: 04 October 2023

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Register Blocking: An Analytical Modelling Approach for Affine Loop KernelsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649194(71-79)Online publication date: 7-May-2024

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media