Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3559009.3569674acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Open access

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

Published: 27 January 2023 Publication History

Abstract

The Convolutional Neural Network (CNN) kernel is a fundamental building block for deep learning, which dominates the computational cost of deep learning pipelines for image analysis. The synthesis of high-performance GPU kernels for CNNs is thus of considerable interest. The current state-of-the-art in optimizing CNN kernels is auto-tuning search using AutoTVM/Ansor, which has been shown to achieve higher performance than vendor libraries as well as polyhedral compilers. A primary reason for the failure of general-purpose optimizing compilers to deliver high-performance code for key kernels like CNN is the challenge of accurate performance modeling to enable effective choice among alternative transformations and/or parameter values such as tile sizes. In this paper we ask if a domain-specific compiler that is customized for the important CNN kernel can be more effective. Our results show that it can be very effective, enabling even higher performance of the generated GPU code for CNNs than auto-tuning with TVM/Ansor. Further, we demonstrate the effectiveness of a performance modeling approach that integrates analytical modeling of data movement volume with machine learning for offline training, enabling much more rapid code optimization than the approach of TVM/Ansor that is based on online construction of a machine learning model to guide auto-tuning search.

References

[1]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 193--205.
[2]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[3]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. Advances in Neural Information Processing Systems 31 (2018), 3389--3400.
[4]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.
[7]
Andrew Kerr. 2020. Nvidia CUTLASS CUDA templates for Linear Algebra. https://github.com/NVIDIA/cutlass.
[8]
Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P Sadayappan. 2021. Analytical characterization and design space exploration for optimization of CNNs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 928--942.
[9]
Nvidia. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[10]
Nvidia. 2017. NVIDIA TURING GPU ARCHITECTURE. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
[11]
Nvidia. 2022. NVIDIA Nsight CLI. https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html.
[12]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[13]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.
[14]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263--7271.
[15]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).
[16]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 1--23.
[17]
Zichen Zhang, Sami Khanal, Amy Raudenbush, Kelley Tilmon, and Christopher Stewart. 2022. Assessing the efficacy of machine learning techniques to characterize soybean defoliation from unmanned aerial vehicles. Computers and Electronics in Agriculture 193 (2022), 106682.
[18]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating high-performance tensor programs for deep learning. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 863--879.
[19]
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 859--873.

Cited By

View all
  • (2024)Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPUProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676945(78-90)Online publication date: 14-Oct-2024
  • (2024)Creating intelligent cyberinfrastructure for democratizing AIAI Magazine10.1002/aaai.1216645:1(22-28)Online publication date: 10-Mar-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
October 2022
569 pages
ISBN:9781450398688
DOI:10.1145/3559009
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. CNN
  2. GPU
  3. design space exploration
  4. performance modeling
  5. tile size optimization

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)230
  • Downloads (Last 6 weeks)31
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPUProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676945(78-90)Online publication date: 14-Oct-2024
  • (2024)Creating intelligent cyberinfrastructure for democratizing AIAI Magazine10.1002/aaai.1216645:1(22-28)Online publication date: 10-Mar-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media