Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3559009.3569678acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
poster

Improving Convolution via Cache Hierarchy Tiling and Reduced Packing

Published: 27 January 2023 Publication History

Abstract

Convolution is one of the most computationally intensive machine learning model operations, usually solved by the known Im2Col + BLAS method. This work proposes a novel convolution-algorithm to improve upon Im2Col + BLAS by introducing (a) CSA: a convolution specific 3D cache-blocking analysis that focuses on tile reuse over the cache hierarchy, (b) CSO: a macro-kernel that follows CSA to compute the convolution by tiling it, (c) a specialized microkernel that seeks to achieve peak hardware performance, and (d) packing routines for the input tensor and filters to bridge the gap between tiling and micro-kernel. Our approach speeds up end-to-end machine learning model inference by up to 26% and 21% for x86 and POWER10 architectures, respectively.

References

[1]
João P. L. de Carvalho, José E. Moreira, and José Nelson Amaral. 2022. Compiling for the IBM Matrix Engine for Enterprise Workloads. IEEE Micro (2022), 1--8.
[2]
Kazushige Goto and Robert van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (may 2008), 25 pages.
[3]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
[4]
Tung D. Le, Gheorghe-Teodor Bercea, Tong Chen, Alexandre E. Eichenberger, Haruki Imai, Tian Jin, Kiyokuni Kawachiya, Yasushi Negishi, and Kevin O'Brien. 2020. Compiling ONNX Neural Network Models Using MLIR. ArXiv abs/2008.08272 (2020).
[5]
Rafael Sousa, Byungmin Jung, Jaehwa Kwak, Michael Frank, and Guido Araujo. 2021. Efficient Tensor Slicing for Multicore NPUs using Memory Burst Modeling. In 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE Computer Society, Los Alamitos, CA, USA, 84--93.
[6]
Zhang Xianyi, Martin Kroeker, Werner Saar, Wang Qian, Zaheer Chothia, Chen Shaohu, and Luo Wen. [n. d.]. OpenBLAS: An optimized BLAS library.
[7]
Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5776--5785. http://proceedings.mlr.press/v80/zhang18d.html

Cited By

View all
  • (2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
October 2022
569 pages
ISBN:9781450398688
DOI:10.1145/3559009
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Check for updates

Author Tags

  1. cache blocking
  2. convolution
  3. data transfer
  4. packing

Qualifiers

  • Poster

Conference

PACT '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)7
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
  • (2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media