research-article

Public Access

Predictive data locality optimization for higher-order tensor computations

Authors:

Tharindu R. Patabandi,

Abhishek Kulkarni,

Pushkar Ratnalikar,

Justin GottschlichAuthors Info & Claims

MAPS 2021: Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming

Pages 43 - 52

https://doi.org/10.1145/3460945.3464955

Published: 20 June 2021 Publication History

Abstract

Automating locality optimization is still an open problem for compiler writers. Compiler-based approaches, guided by analytical cost models have achieved some success in matching high performance libraries on a restricted set of computations such as general matrix multiply (GEMM). On the other hand, library-based approaches may present some open scalability concerns. Recent developments in convolutional neural networks has seen an explosion of models, each with differing combinations of parameters. Manually tuning each of these configurations can take many development months. Further, these operations are called multiple times during machine learning training, which necessitates highly optimized implementations. 2D convolutional operators are unique in that they consist of 7-deep loop nests with different loops carrying reuse for different tensors, making the problem of identifying an optimal loop ordering hard. We devise a machine learning-based compiler which learns a regression model, correlating performance with the loop order. We integrate this model with other traditional compiler analysis for transformations such as loop unrolling and vectorization, relying on the MultiLevel Intermediate Representation (MLIR) compiler framework. We achieve an average speedup of 1.67x and 1.41x against oneDNN for 2D convolution forward and weight update kernels respectively. We are also at 0.88x and 0.96x the performance of oneDNN’s best performing implementation which applies additional data layout transformations.

References

[1]

[n.d.]. oneDNN. https://github.com/oneapi-src/oneDNN Accessed: 2021-04-01.

[2]

Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and et al. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph., 38, 4 (2019), Article 121, July, 12 pages. issn:0730-0301 https://doi.org/10.1145/3306346.3322967

Digital Library

[3]

F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In International Symposium on Code Generation and Optimization (CGO’06). 11 pp.–305. issn:null https://doi.org/10.1109/CGO.2006.37

Digital Library

[4]

L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2004. Finding Effective Compilation Sequences. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’04). Association for Computing Machinery, New York, NY, USA. 231–239. isbn:1581138067 https://doi.org/10.1145/997163.997196

Digital Library

[5]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2Vec: Learning Distributed Representations of Code. Proc. ACM Program. Lang., 3, POPL (2019), Article 40, Jan., 29 pages. issn:2475-1421 https://doi.org/10.1145/3290353

Digital Library

[6]

Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A Survey on Compiler Autotuning Using Machine Learning. ACM Comput. Surv., 51, 5 (2018), Article 96, Sept., 42 pages. issn:0360-0300 https://doi.org/10.1145/3197978

Digital Library

[7]

A. H. Ashouri, G. Mariani, G. Palermo, and C. Silvano. 2014. A Bayesian network approach for compiler auto-tuning for embedded processors. In 2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia). 90–97. https://doi.org/10.1109/ESTIMedia.2014.6962349

[8]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166.

[9]

Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219–232.

[10]

M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE, 93, 2 (2005), Feb, 216–231. issn:1558-2256 https://doi.org/10.1109/JPROC.2004.840301

[11]

Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard, Elton Ashton, Edwin V. Bonilla, John Thomson, Christopher K. I. Williams, and Michael F. P. O’Boyle. 2011. Milepost GCC: Machine Learning Enabled Self-tuning Compiler. Int. J. Parallel Program., 39, 3 (2011), 296–327. http://dblp.uni-trier.de/db/journals/ijpp/ijpp39.html##FursinKMCTNYMZCa11

[12]

Justin Gottschlich, Armando Solar-Lezama, Nesime Tatbul, Michael Carbin, Martin Rinard, Regina Barzilay, Saman Amarasinghe, Joshua B Tenenbaum, and Tim Mattson. 2018. The three pillars of machine programming. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 69–80.

Digital Library

[13]

Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: end-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.

Digital Library

[14]

Q. Huang, A. Haj-Ali, W. Moses, J. Xiang, I. Stoica, K. Asanovic, and J. Wawrzynek. 2019. AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 308–308. https://doi.org/10.1109/FCCM.2019.00049

[15]

C. Kartsaklis, O. Hernandez, C. Hsu, T. Ilsche, W. Joubert, and R. L. Graham. 2012. HERCULES: A Pattern Driven Code Transformation System. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 574–583. https://doi.org/10.1109/IPDPSW.2012.69

Digital Library

[16]

P. A. Kulkarni, D. B. Whalley, G. S. Tyson, and J. W. Davidson. 2006. Exhaustive optimization phase order space exploration. In International Symposium on Code Generation and Optimization (CGO’06). 13 pp.–318. issn:null https://doi.org/10.1109/CGO.2006.15

Digital Library

[17]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arxiv:2002.11054.

[18]

Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P Sadayappan. 2021. Analytical Characterization and Design Space Exploration for Optimization of CNNs. arXiv preprint arXiv:2101.09808.

[19]

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel, and Manuela Veloso. 2000. SPIRAL: Automatic Implementation of Signal Processing Algorithms. In High Performance Extreme Computing (HPEC).

[20]

Tharindu R. Patabandi, Anand Venkat, Rajkishore Barik, and Mary Hall. 2021. SWIRL ++ : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks. In Languages and Compilers for Parallel Computing, Santosh Pande and Vivek Sarkar (Eds.). Springer International Publishing, Cham. 108–126. isbn:978-3-030-72789-5 https://doi.org/10.1007/978-3-030-72789-5_9

[21]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not., 48, 6 (2013), June, 519–530. issn:0362-1340 https://doi.org/10.1145/2499370.2462176

Digital Library

[22]

T. M. Smith, R. v. d. Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. V. Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1049–1059. issn:1530-2075 https://doi.org/10.1109/IPDPS.2014.110

Digital Library

[23]

Y. N. Srikant and Priti Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition (2nd ed.). CRC Press, Inc., USA. isbn:142004382X

[24]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’11). Association for Computing Machinery, New York, NY, USA. 117–128. isbn:9781450307437 https://doi.org/10.1145/1989493.1989508

Digital Library

[25]

Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is Search Really Necessary to Generate High-Performance BLAS? Proc. IEEE, 93, 2 (2005), 358–386. https://doi.org/10.1109/JPROC.2004.840444

[26]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 859–873.

Digital Library

Cited By

Kelefouras VKeramidas G(2023)Design and Implementation of Deep Learning 2D Convolutions on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332203734:12(3104-3116)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3322037

Index Terms

Predictive data locality optimization for higher-order tensor computations
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

A practical automatic polyhedral parallelizer and locality optimizer
PLDI '08

We present the design and implementation of an automatic polyhedral source-to-source transformation framework that can optimize regular programs (sequences of possibly imperfectly nested loops) for parallelism and locality simultaneously. Through this ...
A Matrix-Based Approach to the Global Locality Optimization Problem
PACT '98: Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques

Global locality analysis is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout optimizations. Pure loop transformations are restricted by data dependences and may not be very ...
New tiling techniques to improve cache temporal locality
PLDI '99: Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation

Tiling is a well-known loop transformation to improve temporal locality of nested loops. Current compiler algorithms for tiling are limited to loops which are perfectly nested or can be transformed, in trivial ways, into a perfect nest. This paper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MAPS 2021: Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming

June 2021

52 pages

ISBN:9781450384674

DOI:10.1145/3460945

General Chair:
Roopsha Samanta
Purdue University, USA
,
Program Chair:
Isil Dillig
University of Texas at Austin, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PLDI '21

Sponsor:

SIGPLAN

PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 21, 2021

Virtual, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)16

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kelefouras VKeramidas G(2023)Design and Implementation of Deep Learning 2D Convolutions on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332203734:12(3104-3116)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3322037

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten