research-article

Design and Implementation of Deep Learning 2D Convolutions on Modern CPUs

Authors:

Vasilios Kelefouras,

Georgios KeramidasAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 12

Pages 3104 - 3116

https://doi.org/10.1109/TPDS.2023.3322037

Published: 04 October 2023 Publication History

Abstract

In this article, a new method is provided for accelerating the execution of convolution layers in Deep Neural Networks. This research work provides the theoretical background to efficiently design and implement the convolution layers on x86/x64 CPUs, based on the target layer parameters, quantization level and hardware architecture. The proposed work is general and can be applied to other processor families too, e.g., Arm. The proposed work achieves high speedup values over the state of the art, which is Intel oneDNN library, by applying compiler optimizations, such as vectorization, register blocking and loop tiling, in a more efficient way. This is achieved by developing an analytical modelling approach for finding the optimization parameters. A thorough experimental evaluation has been applied on two Intel CPU platforms, for DenseNet-121, ResNet-50 and SqueezeNet (including 112 different convolution layers), and for both FP32 and int8 input/output tensors (quantization). The experimental results show that the convolution layers of the aforementioned models are executed from <inline-formula><tex-math notation="LaTeX">$x1.1$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>x</mml:mi><mml:mn>1</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="kelefouras-ieq1-3322037.gif"/></alternatives></inline-formula> up to <inline-formula><tex-math notation="LaTeX">$x7.2$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>x</mml:mi><mml:mn>7</mml:mn><mml:mo>.</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math><inline-graphic xlink:href="kelefouras-ieq2-3322037.gif"/></alternatives></inline-formula> times faster.

References

[1]

E. Georganas et al., “Anatomy of high-performance deep learning convolutions on SIMD architectures,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2018, Art. no.

[2]

A. Jayasimhan and P. Pabitha, “A comparison between CPU and GPU for image classification using convolutional neural networks,” in Proc. Int. Conf. Commun., Comput. Internet Things, 2022, pp. 1–4.

[3]

Z. Gong, H. Ji, C. W. Fletcher, C. J. Hughes, and J. Torrellas, “SparseTrain: Leveraging dynamic sparsity in software for training DNNs on general-purpose SIMD processors,” in Proc. ACM Int. Conf. Parallel Architectures Compilation Techn., New York, USA, 2020, pp. 279–292.

[4]

N. Tollenaere et al., “Autotuning convolutions is easier than you think,” ACM Trans. Architecture Code Optim., vol. 20, Nov. 2022, Art. no.

[5]

Intel, “oneDNN v2.7.0 documentation, understanding memory formats,” Oct. 2022. [Online]. Available: https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html

[6]

Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” 2017,.

[7]

A. Frickenstein, M. R. Vemparala, C. Unger, F. Ayar, and W. Stechele, “DSC: Dense-sparse convolution for vectorized inference of convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 1353–1360.

[8]

Q. Wang, D. Li, X. Huang, S. Shen, S. Mei, and J. Liu, “Optimizing FFT-based convolution on ARMv8 multi-core CPUs,” in Proc. Eur. Conf. Parallel Process., 2020, pp. 248–262.

[9]

D. M. Budden, A. Matveev, S. Santurkar, S. R. Chaudhuri, and N. Shavit, “Deep tensor convolution on multicores,” 2016,.

[10]

Z. Jia, A. Zlateski, F. Durand, and K. Li, “Optimizing N-dimensional, winograd-based convolution for manycore CPUs,” in Proc. 23rd ACM SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 109–123.

[11]

M. Dukhan, “The indirect convolution algorithm,” 2019,.

[12]

J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-memory overhead direct convolutions,” 2018,.

[13]

V. Kelefouras and G. Keramidas, “Design and implementation of 2D convolution on x86/x64 processors,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 12, pp. 3800–3815, Dec. 2022.

[14]

L. Ismail and D. Guerchi, “Performance evaluation of convolution on the cell broadband engine processor,” IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 2, pp. 337–351, Feb. 2011.

Digital Library

[15]

V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover, “Diesel: DSL for linear algebra and neural net computations on GPUs,” in Proc. 2nd ACM SIGPLAN Int. Workshop Mach. Learn. Program. Lang., New York, NY, USA, 2018, pp. 42–51.

[16]

T. Chen et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” in Proc. 13th USENIX Conf. Operating Syst. Des. Implementation, USA, 2018, pp. 579–594.

[17]

L. Zheng et al., “Ansor: Generating high-performance tensor programs for deep learning,” in Proc. 14th USENIX Conf. Operating Syst. Des. Implementation, 2020, Art. no.

[18]

R. Li, Y. Xu, A. Sukumaran-Rajam, A. Rountev, and P. Sadayappan, “Analytical characterization and design space exploration for optimization of CNNs,” in Proc. 26th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., New York, NY, USA, 2021, pp. 928–942.

[19]

T. R. Patabandi, A. Venkat, A. Kulkarni, P. Ratnalikar, M. Hall, and J. Gottschlich, “Predictive data locality optimization for higher-order tensor computations,” in Proc. 5th ACM SIGPLAN Int. Symp. Mach. Program., New York, NY, USA, 2021, pp. 43–52.

[20]

C. Narendra et al., “Efficient and generic 1D dilated convolution layer for deep learning,” 2021,.

[21]

H. Kataoka, K. Yamashita, Y. Ito, K. Nakano, A. Kasagi, and T. Tabaru, “An efficient multicore CPU implementation for convolution-pooling computation in CNNs,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops, 2020, pp. 548–556.

[22]

S.-J. Lee, S.-S. Park, and K.-S. Chung, “Efficient SIMD implementation for accelerating convolutional neural network,” in Proc. 4th Int. Conf. Commun. Inf. Process., New York, NY, USA, 2018, pp. 174–179.

[23]

A. Tabuchi et al., “The 16,384-node parallelism of 3D-CNN training on an arm CPU based supercomputer,” in Proc. IEEE 28th Int. Conf. High Perform. Comput. Data Analytics, 2021, pp. 152–161.

[24]

R. Hao et al., “Towards effective depthwise convolutions on ARMv8 architecture,” 2022,.

[25]

V. Kelefouras, K. Djemame, G. Keramidas, and N. Voros, “A methodology for efficient tile size selection for affine loop kernels,” Int. J. Parallel Program., vol. 50, no. 3/4, pp. 405–432, 2022.

Digital Library

[26]

Intel, “Performance profiling example,” Nov. 2022. [Online]. Available: https://oneapi-src.github.io/oneDNN/page_performance_profiling_cpp.html

Cited By

Anthimopoulos TKeramidas GKelefouras VStamoulis I(2024)Register Blocking: An Analytical Modelling Approach for Affine Loop KernelsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649194(71-79)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649194

Index Terms

Design and Implementation of Deep Learning 2D Convolutions on Modern CPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512
Selection on Modern CPUs
IMDM '15: Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics

Modern processors employ sophisticated techniques such as speculative or out-of-order execution to hide memory latencies and keep their pipelines fully utilized. However, these techniques introduce high complexity and variance to query processing. In ...
Reinforcement learning-based register renaming policy for simultaneous multithreading CPUs▪
Abstract
Simultaneous multithreading (SMT) improves the performance of superscalar CPUs by exploiting thread-level parallelism with shared entries for better utilization of resources. A key issue for this out-of-order execution is that the occupancy ...
Highlights
- The first reinforcement learning approach for effective resource utilization in SMT.
- Continuous action space for fine-grained and accurate action space exploration.
- Encouraging the convergence of the model via forming a lower bound ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 34, Issue 12

Dec. 2023

311 pages

ISSN:1045-9219

Issue’s Table of Contents

1045-9219 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 04 October 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Anthimopoulos TKeramidas GKelefouras VStamoulis I(2024)Register Blocking: An Analytical Modelling Approach for Affine Loop KernelsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649194(71-79)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649194

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents