poster

Improving Convolution via Cache Hierarchy Tiling and Reduced Packing

Authors:

Victor Ferrari,

Rafael Sousa,

Marcio Pereira,

João P. L. de Carvalho,

José Nelson Amaral,

Guido AraujoAuthors Info & Claims

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

Pages 538 - 539

https://doi.org/10.1145/3559009.3569678

Published: 27 January 2023 Publication History

Get Access

Abstract

Convolution is one of the most computationally intensive machine learning model operations, usually solved by the known Im2Col + BLAS method. This work proposes a novel convolution-algorithm to improve upon Im2Col + BLAS by introducing (a) CSA: a convolution specific 3D cache-blocking analysis that focuses on tile reuse over the cache hierarchy, (b) CSO: a macro-kernel that follows CSA to compute the convolution by tiling it, (c) a specialized microkernel that seeks to achieve peak hardware performance, and (d) packing routines for the input tensor and filters to bridge the gap between tiling and micro-kernel. Our approach speeds up end-to-end machine learning model inference by up to 26% and 21% for x86 and POWER10 architectures, respectively.

References

[1]

João P. L. de Carvalho, José E. Moreira, and José Nelson Amaral. 2022. Compiling for the IBM Matrix Engine for Enterprise Workloads. IEEE Micro (2022), 1--8.

Digital Library

Google Scholar

[2]

Kazushige Goto and Robert van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (may 2008), 25 pages.

Digital Library

Google Scholar

[3]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

Google Scholar

[4]

Tung D. Le, Gheorghe-Teodor Bercea, Tong Chen, Alexandre E. Eichenberger, Haruki Imai, Tian Jin, Kiyokuni Kawachiya, Yasushi Negishi, and Kevin O'Brien. 2020. Compiling ONNX Neural Network Models Using MLIR. ArXiv abs/2008.08272 (2020).

Google Scholar

[5]

Rafael Sousa, Byungmin Jung, Jaehwa Kwak, Michael Frank, and Guido Araujo. 2021. Efficient Tensor Slicing for Multicore NPUs using Memory Burst Modeling. In 2021 IEEE 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE Computer Society, Los Alamitos, CA, USA, 84--93.

Crossref

Google Scholar

[6]

Zhang Xianyi, Martin Kroeker, Werner Saar, Wang Qian, Zaheer Chothia, Chen Shaohu, and Luo Wen. [n. d.]. OpenBLAS: An optimized BLAS library.

Google Scholar

[7]

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5776--5785. http://proceedings.mlr.press/v80/zhang18d.html

Google Scholar

Cited By

View all

Wu DMeng JZhu WDeng MWang XLuo TWahib MWei Y(2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00027
Jayaweera MKong MWang YKaeli DGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444795

Index Terms

Improving Convolution via Cache Hierarchy Tiling and Reduced Packing

Recommendations

Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions
Convolution is one of the most computationally intensive operations that must be performed for machine learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This article proposes SConv: a direct-...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Improving memory hierarchy performance with hardware prefetching and cache replacement

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

October 2022

569 pages

ISBN:9781450398688

DOI:10.1145/3559009

General Chair:
Andreas Kloeckner
University of Illinois
,
Program Chair:
José Moreira
IBM

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Check for updates

Author Tags

Qualifiers

Poster

Conference

PACT '22

Sponsor:

SIGARCH

PACT '22: International Conference on Parallel Architectures and Compilation Techniques

October 8 - 12, 2022

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
87
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)7

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wu DMeng JZhu WDeng MWang XLuo TWahib MWei Y(2024)autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm ArchitecturesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00027(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00027
Jayaweera MKong MWang YKaeli DGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Energy-Aware Tile Size Selection for Affine Programs on GPUsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444795(13-27)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444795

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions

Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies

Improving memory hierarchy performance with hardware prefetching and cache replacement