Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access
Just Accepted

(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms

Online AM: 22 May 2024 Publication History

Abstract

Data-parallel computations, such as linear algebra routines (BLAS) and stencil computations, constitute one of the most relevant classes in parallel computing, e.g., due to their importance for deep learning. Efficiently de-composing such computations for the memory and core hierarchies of modern architectures and re-composing the computed intermediate results back to the final result – we say (de/re)-composition for short – is key to achieve high performance for these computations on, e.g., GPU and CPU. Current high-level approaches to generating data-parallel code are often restricted to a particular subclass of data-parallel computations and architectures (e.g., only linear algebra routines on only GPU, or only stencil computations), and/or the approaches rely on a user-guided optimization process for a well-performing (de/re)-composition of computations, which is complex and error prone for the user.
We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of Multi-Dimensional Homomorphisms (MDHs). Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world data sets and for a variety of data-parallel computations, including: linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.

References

[1]
Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2017. Fastflow: high-level and efficient streaming on multi-core. Programming multi-core and many-core computing systems, parallel and distributed computing (2017).
[2]
AMD. 2024. HIP: C++ Heterogeneous-Compute Interface for Portability. https://github.com/ROCm/HIP.
[3]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). Association for Computing Machinery, New York, NY, USA, 303–316. https://doi.org/10.1145/2628071.2628092
[4]
Apache. 2022. TVM: Open Deep Learning Compiler Stack. https://github.com/apache/tvm.
[5]
Apache Software Foundation. 2021. TVM and Open Source ML Acceleration Conference. https://www.tvmcon.org.
[6]
Apache TVM Community. 2020. Non top-level reductions in compute statements. https://discuss.tvm.apache.org/t/non-top-level-reductions-in-compute-statements/5693.
[7]
Apache TVM Community. 2022a. Bind reduce axis to blocks. https://discuss.tvm.apache.org/t/bind-reduce-axis-to-blocks/2907.
[8]
Apache TVM Community. 2022b. Expressing nested reduce operations. https://discuss.tvm.apache.org/t/expressing-nested-reduce-operations/8784.
[9]
Apache TVM Community. 2022c. Implementing Array Packing via cache_read. https://discuss.tvm.apache.org/t/implementing-array-packing-via-cache-read/13360.
[10]
Apache TVM Community. 2022d. Invalid comm_reducer. https://discuss.tvm.apache.org/t/invalid-comm-reducer/12788.
[11]
Apache TVM Community. 2022e. Undetected parallelization issue. https://discuss.tvm.apache.org/t/undetected-parallelization-issue/13224.
[12]
Apache TVM Documentation. 2022a. Bind ivar to thread index thread_ivar. https://tvm.apache.org/docs/reference/api/python/te.html?highlight=bind#tvm.te.Stage.bind.
[13]
Apache TVM Documentation. 2022b. Tuning High Performance Convolution on NVIDIA GPUs. https://tvm.apache.org/docs/how_to/tune_with_autotvm/tune_conv2d_cuda.html.
[14]
David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler Transformations for High-Performance Computing. ACM Comput. Surv. 26, 4 (dec 1994), 345–420. https://doi.org/10.1145/197405.197406
[15]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
[16]
Lénaïc Bagnères, Oleksandr Zinenko, Stéphane Huot, and Cédric Bastoul. 2016. Opening polyhedral compiler’s black box. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). Association for Computing Machinery, New York, NY, USA, 128–138. https://doi.org/10.1145/2854038.2854048
[17]
Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in High-Performance Computing Applications. Proc. IEEE 106, 11 (2018), 2068–2083. https://doi.org/10.1109/JPROC.2018.2841200
[18]
Paul Barham and Michael Isard. 2019. Machine Learning Systems Are Stuck in a Rut. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’19). Association for Computing Machinery, New York, NY, USA, 177–183. https://doi.org/10.1145/3317550.3321441
[19]
Cedric Bastoul, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Adilla Susungi, Javier de Juan, Etienne Filhol, Baptiste Jarry, Gianpietro Consolaro, and Renwei Zhang. 2022. Optimizing GPU Deep Learning Operators with Polyhedral Scheduling Constraint Injection. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 313–324. https://doi.org/10.1109/CGO53902.2022.9741260
[20]
Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19).
[21]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. SIGPLAN Not. 52, 8 (Jan. 2017), 235–248. https://doi.org/10.1145/3155284.3018756
[22]
Richard S. Bird. 1989. Lectures on Constructive Functional Programming. In Constructive Methods in Computing Science, Manfred Broy (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 151–217.
[23]
Guy E. Blelloch. 1990. Prefix Sums and Their Applications. Technical Report CMU-CS-90-190. School of Computer Science, Carnegie Mellon University.
[24]
Barry Boehm, Bradford Clark, Ellis Horowitz, Chris Westland, Ray Madachy, and Richard Selby. 1995. Cost models for future software life cycle processes: COCOMO 2.0. Annals of Software Engineering 1, 1 (1995), 57–94. https://doi.org/10.1007/BF02249046
[25]
Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. arXiv:cs.PF/2003.00532
[26]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008a. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. In Compiler Construction, Laurie Hendren (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 132–146.
[27]
Uday Bondhugula, A Hartono, J Ramanujam, and P Sadayappan. 2008b. Pluto: A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08), Tucson, AZ (June 2008). Citeseer.
[28]
Ajay Brahmakshatriya and Saman Amarasinghe. 2021. BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 39–51. https://doi.org/10.1109/CGO51591.2021.9370333
[29]
C++ reference. 2022. Date and time utilities. https://en.cppreference.com/w/cpp/chrono.
[30]
José María Cecilia, José Manuel García, and Manuel Ujaldón. 2012. CUDA 2D Stencil Computations for the Jacobi Method. In Applied Parallel and Scientific Computing, Kristján Jónasson (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 173–183.
[31]
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell Array Codes with Multicore GPUs. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming (DAMP ’11). Association for Computing Machinery, New York, NY, USA, 3–14. https://doi.org/10.1145/1926354.1926358
[32]
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Technical Report 08-897, U. of Southern California.
[33]
Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R. Gao. 2010. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470413
[34]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen
[35]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to Optimize Tensor Programs. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/8b5700012be65c9da25f49408d959ca0-Paper.pdf
[36]
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated.
[37]
Basile Clément and Albert Cohen. 2022. End-to-End Translation Validation for the Halide Language. In OOPSLA 2022 - Conference on Object-Oriented Programming Systems, Languages, and Applications (Proceedings of the ACM on Programming Languages (PACMPL)), Vol. 6. Auckland, New Zealand. https://doi.org/10.1145/3527328
[38]
MURRAY I. COLE. 1995. PARALLEL PROGRAMMING WITH LIST HOMOMORPHISMS. Parallel Processing Letters 05, 02 (1995), 191–203. https://doi.org/10.1142/S0129626495000175 arXiv:https://doi.org/10.1142/S0129626495000175
[39]
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, and Cedric Bastoul. 2024. PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler. In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 28–40. https://doi.org/10.1109/CGO57630.2024.10444791
[40]
Johannes Doerfert, Kevin Streit, Sebastian Hack, and Zino Benaissa. 2015. Polly’s Polyhedral Scheduling in the Presence of Reductions. CoRR abs/1505.07716 (2015). arXiv:1505.07716 http://arxiv.org/abs/1505.07716
[41]
Vincent Dumoulin and Francesco Visin. 2018. A guide to convolution arithmetic for deep learning. arXiv:stat.ML/1603.07285
[42]
Johan Enmyren and Christoph W. Kessler. 2010. SkePU: A Multi-Backend Skeleton Programming Library for Multi-GPU Systems. In Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications (HLPP ’10). Association for Computing Machinery, New York, NY, USA, 5–14. https://doi.org/10.1145/1863482.1863487
[43]
August Ernstsson, Lu Li, and Christoph Kessler. 2018. SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems. International Journal of Parallel Programming 46, 1 (2018), 62–80. https://doi.org/10.1007/s10766-017-0490-5
[44]
Facebook Research. 2022. Tensor Comprehensions. https://github.com/facebookresearch/TensorComprehensions.
[45]
Azadeh Farzan and Victor Nicolet. 2019. Modular Divide-and-Conquer Parallelization of Nested Loops. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 610–624. https://doi.org/10.1145/3314221.3314612
[46]
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International Journal of Parallel Programming 21, 5 (1992), 313–347. https://doi.org/10.1007/BF01407835
[47]
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 804–817. https://doi.org/10.1145/3575693.3576933
[48]
M. Frigo and S.G. Johnson. 1998. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), Vol. 3. 1381–1384 vol.3. https://doi.org/10.1109/ICASSP.1998.681704
[49]
M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039). 285–297. https://doi.org/10.1109/SFFCS.1999.814600
[50]
Junji Fukuhara and Munehiro Takimoto. 2022. Automated Kernel Fusion for GPU Based on Code Motion. In Proceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2022). Association for Computing Machinery, New York, NY, USA, 151–161. https://doi.org/10.1145/3519941.3535078
[51]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. International Journal of Parallel Programming 34, 3 (2006), 261–317. https://doi.org/10.1007/s10766-006-0012-3
[52]
GNU/Linux. 2022. clock_gettime(3) – Linux man page. https://linux.die.net/man/3/clock_gettime.
[53]
Horacio González-Vélez and Mario Leyton. 2010. A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Software: Practice and Experience 40, 12 (2010), 1135–1160. https://doi.org/10.1002/spe.1026 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.1026
[54]
Google SIG MLIR Open Design Meeting. 2020. Using MLIR for Multi-Dimensional Homomorphisms. https://www.youtube.com/watch?v=RQR_9tHscMI
[55]
Sergei Gorlatch. 1999. Extracting and implementing list homomorphisms in parallel program development. Science of Computer Programming 33, 1 (1999), 1–27. https://doi.org/10.1016/S0167-6423(97)00014-2
[56]
Sergei Gorlatch and Murray Cole. 2011. Parallel skeletons. In Encyclopedia of parallel computing. Springer-Verlag GmbH, 1417–1422.
[57]
S. Gorlatch and C. Lengauer. 1997. (De) composition rules for parallel scan and reduction. In Proceedings. Third Working Conference on Massively Parallel Programming Models (Cat. No.97TB100228). 23–32. https://doi.org/10.1109/MPPM.1997.715958
[58]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (may 2008), 25 pages. https://doi.org/10.1145/1356052.1356053
[59]
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly - Performing Polyhedral Optimizations on a Low-level Intermediate Representation. Parallel Processing Letters 22, 04 (2012), 1250010. https://doi.org/10.1142/S0129626412500107 arXiv:https://doi.org/10.1142/S0129626412500107
[60]
Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Trans. Program. Lang. Syst. 37, 4, Article 12 (jul 2015), 50 pages. https://doi.org/10.1145/2743016
[61]
John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw. 27, 4 (dec 2001), 422–455. https://doi.org/10.1145/504210.504213
[62]
Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020a. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ’20). Association for Computing Machinery, New York, NY, USA, 71–82. https://doi.org/10.1145/3410463.3414632
[63]
Bastian Hagedorn, Bin Fan, Hanfeng Chen, Cris Cecka, Michael Garland, and Vinod Grover. 2023. Graphene: An IR for Optimized Tensor Computations on GPUs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 302–313. https://doi.org/10.1145/3582016.3582018
[64]
Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020b. Achieving High-Performance the Functional Way: A Functional Pearl on Expressing High-Performance Optimizations as Rewrite Strategies. Proc. ACM Program. Lang. 4, ICFP, Article 92 (Aug. 2020), 29 pages. https://doi.org/10.1145/3408974
[65]
Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). Association for Computing Machinery, New York, NY, USA, 100–112. https://doi.org/10.1145/3168824
[66]
Mary Hall. 2020. Research Challenges in Compiler Technology for Sparse Tensors. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). viii–viii. https://doi.org/10.1109/IA351965.2020.00006
[67]
Maurice H Halstead. 1977. Elements of Software Science (Operating and programming systems series). Elsevier Science Inc.
[68]
Haskell.org. 2022. Haskell: An advanced, purely functional programming language. https://www.haskell.org.
[69]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
[70]
John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (jan 2019), 48–60. https://doi.org/10.1145/3282307
[71]
Troels Henriksen, Sune Hellfritzsch, Ponnuswamy Sadayappan, and Cosmin Oancea. 2020. Compiling Generalized Histograms for GPU. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00101
[72]
Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA, 556–571. https://doi.org/10.1145/3062341.3062354
[73]
Troels Henriksen, Frederik Thorøe, Martin Elsman, and Cosmin Oancea. 2019. Incremental Flattening for Nested Data Parallelism. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP ’19). Association for Computing Machinery, New York, NY, USA, 53–67. https://doi.org/10.1145/3293883.3295707
[74]
K Hentschel et al. 2008. Das Krebsregister-Manual der Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V. Zuckschwerdt Verlag.
[75]
Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. 2018. Matrix capsules with EM routing. In International Conference on Learning Representations. https://openreview.net/forum?id=HJWLfGWRb
[76]
Torsten Hoefler and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses When Reporting Performance Results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). Association for Computing Machinery, New York, NY, USA, Article 73, 12 pages. https://doi.org/10.1145/2807591.2807644
[77]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861
[78]
Cristina Hristea, Daniel Lenoski, and John Keen. 1997. Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (SC ’97). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/509593.509638
[79]
Intel. 2019. Math Kernel Library Improved Small Matrix Performance Using Just-in-Time (JIT) Code Generation for Matrix Multiplication (GEMM). https://www.intel.com/content/www/us/en/developer/articles/technical/onemkl-improved-small-matrix-performance-using-just-in-time-jit-code.html.
[80]
Intel. 2022a. oneAPI Math Kernel Library Link Line Advisor. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html.
[81]
Intel. 2022b. oneDNN. https://oneapi-src.github.io/oneDNN/group_dnnl_api.html.
[82]
Intel. 2022c. oneMKL. https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/api-based-programming/intel-oneapi-math-kernel-library-onemkl.html.
[83]
Wayne Kelly and William Pugh. 1998. A framework for unifying reordering transformations. Technical Report. Technical Report UMIACS-TR-92-126.1.
[84]
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A Script-Based Autotuning Compiler System to Generate High-Performance CUDA Code. ACM Trans. Archit. Code Optim. 9, 4, Article 31 (jan 2013), 25 pages. https://doi.org/10.1145/2400682.2400690
[85]
Khronos. 2022a. Khronos Releases Vulkan SC 1.0 Open Standard for Safety-Critical Accelerated Graphics and Compute. https://www.khronos.org/news/press/khronos-releases-vulkan-safety-critical-1.0-specification-to-deliver-safety-critical-graphics-compute.
[86]
Khronos. 2022b. OpenCL: Open Standard For Parallel Programming of Heterogeneous Systems. https://www.khronos.org/opencl/.
[87]
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A Code Generator for High-Performance Tensor Contractions on GPUs. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 85–95. https://doi.org/10.1109/CGO.2019.8661182
[88]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (oct 2017), 29 pages. https://doi.org/10.1145/3133901
[89]
Michael Klemm, Alejandro Duran, Xinmin Tian, Hideki Saito, Diego Caballero, and Xavier Martorell. 2012. Extending OpenMP* with Vector Constructs for Modern Multicore SIMD Architectures. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World (IWOMP’12). Springer-Verlag, Berlin, Heidelberg, 59–72. https://doi.org/10.1007/978-3-642-30961-8_5
[90]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[91]
Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–10. https://doi.org/10.1109/CGO.2013.6494986
[92]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). Association for Computing Machinery, New York, NY, USA, 63–74. https://doi.org/10.1145/106972.106981
[93]
C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis amp; transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75–86. https://doi.org/10.1109/CGO.2004.1281665
[94]
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. https://doi.org/10.1109/CGO51591.2021.9370308
[95]
Hugh Leather, Edwin Bonilla, and Michael O’boyle. 2014. Automatic Feature Generation for Machine Learning–Based Optimising Compilation. ACM Trans. Archit. Code Optim. 11, 1, Article 14 (Feb. 2014), 32 pages. https://doi.org/10.1145/2536688
[96]
Roland Leißa, Klaas Boesche, Sebastian Hack, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, André Müller, and Bertil Schmidt. 2018. AnyDSL: a partial evaluation framework for programming high-performance libraries. Proc. ACM Program. Lang. 2, OOPSLA, Article 119 (oct 2018), 30 pages. https://doi.org/10.1145/3276489
[97]
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022. Automatic Horizontal Fusion for GPU Kernels. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 14–27. https://doi.org/10.1109/CGO53902.2022.9741270
[98]
Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P. Sadayappan. 2021. Analytical Characterization and Design Space Exploration for Optimization of CNNs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 928–942. https://doi.org/10.1145/3445814.3446759
[99]
Xiaqing Li, Guangyan Zhang, H. Howie Huang, Zhufan Wang, and Weimin Zheng. 2016. Performance Analysis of GPU-Based Convolutional Neural Networks. In 2016 45th International Conference on Parallel Processing (ICPP). 67–76. https://doi.org/10.1109/ICPP.2016.15
[100]
Amanda Liu, Gilbert Louis Bernstein, Adam Chlipala, and Jonathan Ragan-Kelley. 2022. Verified Tensor-Program Optimization via High-Level Scheduling Rewrites. Proc. ACM Program. Lang. 6, POPL, Article 55 (jan 2022), 28 pages. https://doi.org/10.1145/3498717
[101]
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 522–531. https://doi.org/10.1109/IPDPSW.2018.00091
[102]
T.J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering SE-2, 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
[103]
Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Trans. Program. Lang. Syst. 18, 4 (jul 1996), 424–453. https://doi.org/10.1145/233561.233564
[104]
MDH Project. 2024. Multi-Dimensional Homomorphisms (MDH): An Algebraic Approach Toward Performance & Portability & Productivity for Data-Parallel Computations. https://mdh-lang.org.
[105]
Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. 2014. Benchmarking the Memory Hierarchy of Modern GPUs. In Network and Parallel Computing, Ching-Hsien Hsu, Xuanhua Shi, and Valentina Salapura (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 144–156.
[106]
Benoît Meister, Eric Papenhausen Akai Kaeru, and Benoît Pradelle Silexica. 2019. Polyhedral Tensor Schedulers. In 2019 International Conference on High Performance Computing & Simulation (HPCS). 504–512. https://doi.org/10.1109/HPCS48598.2019.9188233
[107]
Massinissa Merouani, Khaled Afif Boudaoud, Iheb Nassim Aouadj, Nassim Tchoulak, Islem Kara Bernou, Hamza Benyamina, Fatima Benbouzid-Si Tayeb, Karima Benatchba, Hugh Leather, and Riyadh Baghdadi. 2024. LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers. arXiv:cs.PL/2403.11522
[108]
Michael Kruse. 2022. Polyhedral Parallel Code Generation. https://github.com/Meinersbur/ppcg, commit = 8a74e46, date = 19.11.2020.
[109]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952
[110]
Stefan K. Muller and Jan Hoffmann. 2021. Modeling and Analyzing Evaluation Cost of CUDA Kernels. Proc. ACM Program. Lang. 5, POPL, Article 25 (Jan. 2021), 31 pages. https://doi.org/10.1145/3434306
[111]
NVIDIA. 2017. Programming Tensor Cores in CUDA 9. https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/
[112]
NVIDIA. 2018. Warp-level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
[113]
NVIDIA. 2022a. CUB. https://docs.nvidia.com/cuda/cub/.
[114]
NVIDIA. 2022b. cuBLAS. https://developer.nvidia.com/cublas.
[115]
NVIDIA. 2022c. cuBLAS – BLAS-like Extension. https://docs.nvidia.com/cuda/cublas/index.html#blas-like-extension
[116]
NVIDIA. 2022d. cuBLAS – Using the cuBLASLt API. https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublaslt-api
[117]
NVIDIA. 2022e. CUDA Deep Neural Network library. https://developer.nvidia.com/cudnn
[118]
NVIDIA. 2022f. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[119]
NVIDIA. 2022g. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/.
[120]
NVIDIA. 2022h. NVRTC. https://docs.nvidia.com/cuda/nvrtc.
[121]
NVIDIA. 2022i. Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution.
[122]
OctoML. 2022. Accelerated Machine Learning Deployment. https://octoml.ai.
[123]
Geraldo F. Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, and Onur Mutlu. 2021. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks. IEEE Access 9 (2021), 134457–134502. https://doi.org/10.1109/ACCESS.2021.3110993
[124]
OpenMP. 2022. The OpenMP API Specification for Parallel Programming. https://www.openmp.org.
[125]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
[126]
Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. 2021. Getting to the Point: Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming. Proc. ACM Program. Lang. 5, ICFP, Article 88 (aug 2021), 29 pages. https://doi.org/10.1145/3473593
[127]
S.J. Pennycook, J.D. Sewall, and V.W. Lee. 2019. Implications of a metric for performance portability. Future Generation Computer Systems 92 (2019), 947–958. https://doi.org/10.1016/j.future.2017.08.007
[128]
Phitchaya Mangpo Phothilimthana, Archibald Samuel Elliott, An Wang, Abhinav Jangda, Bastian Hagedorn, Henrik Barthels, Samuel J. Kaufman, Vinod Grover, Emina Torlak, and Rastislav Bodik. 2019. Swizzle Inventor: Data Movement Synthesis for GPU Kernels. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA, 65–78. https://doi.org/10.1145/3297858.3304059
[129]
Federico Pizzuti, Michel Steuwer, and Christophe Dubach. 2020. Generating Fast Sparse Matrix Vector Multiplication from a High Level Generic Functional IR. In Proceedings of the 29th International Conference on Compiler Construction (CC 2020). Association for Computing Machinery, New York, NY, USA, 85–95. https://doi.org/10.1145/3377555.3377896
[130]
Victor Podlozhnyuk. 2007. Image Convolution with CUDA. NVIDIA Corporation White Paper (2007).
[131]
M. Puschel, J.M.F. Moura, J.R. Johnson, D. Padua, M.M. Veloso, B.W. Singer, Jianxin Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R.W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93, 2 (2005), 232–275. https://doi.org/10.1109/JPROC.2004.840306
[132]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176
[133]
Ari Rasch. 2024. Full Version: (De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms. arXiv:cs.PL/2405.05118
[134]
Ari Rasch, Julian Bigge, Martin Wrodarczyk, Richard Schulze, and Sergei Gorlatch. 2020a. dOCAL: high-level distributed programming with OpenCL and CUDA. The Journal of Supercomputing 76, 7 (2020), 5117–5138. https://doi.org/10.1007/s11227-019-02829-2
[135]
Ari Rasch and Sergei Gorlatch. 2016. Multi-Dimensional Homomorphisms and Their Implementation in OpenCL. In International Workshop on High-Level Parallel Programming and Applications (HLPP). 101–119.
[136]
Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2019a. Generating Portable High-Performance Code via Multi-Dimensional Homomorphisms. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 354–369. https://doi.org/10.1109/PACT.2019.00035
[137]
Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2020b. md_poly: A Performance-Portable Polyhedral Compiler based on Multi-Dimensional Homomorphisms. In Proceedings of the International Workshop on Polyhedral Compilation Techniques (IMPACT’20). 1–4.
[138]
Ari Rasch, Richard Schulze, and Sergei Gorlatch. 2020c. md_poly: A Performance-Portable Polyhedral Compiler based on Multi-Dimensional Homomorphisms. In ACM SRC Grand Finals Candidates, 2019 - 2020. 1–5.
[139]
Ari Rasch, Richard Schulze, Waldemar Gorus, Jan Hiller, Sebastian Bartholomäus, and Sergei Gorlatch. 2019b. High-Performance Probabilistic Record Linkage via Multi-Dimensional Homomorphisms. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing (SAC ’19). Association for Computing Machinery, New York, NY, USA, 526–533. https://doi.org/10.1145/3297280.3297330
[140]
Ari Rasch, Richard Schulze, Denys Shabalin, Anne Elster, Sergei Gorlatch, and Mary Hall. 2023. (De/Re)-Compositions Expressed Systematically via MDH-Based Schedules. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC 2023). Association for Computing Machinery, New York, NY, USA, 61–72. https://doi.org/10.1145/3578360.3580269
[141]
Ari Rasch, Richard Schulze, Michel Steuwer, and Sergei Gorlatch. 2021. Efficient Auto-Tuning of Parallel Programs with Interdependent Tuning Parameters via Auto-Tuning Framework (ATF). ACM Trans. Archit. Code Optim. 18, 1, Article 1 (Jan. 2021), 26 pages. https://doi.org/10.1145/3427093
[142]
Ari Rasch, Martin Wrodarczyk, Richard Schulze, and Sergei Gorlatch. 2018. OCAL: An Abstraction for Host-Code Programming with OpenCL and CUDA. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). 408–416. https://doi.org/10.1109/PADSW.2018.8644541
[143]
Chandan Reddy, Michael Kruse, and Albert Cohen. 2016. Reduction Drawing: Language Constructs and Polyhedral Compilation for Reductions on GPU. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). Association for Computing Machinery, New York, NY, USA, 87–97. https://doi.org/10.1145/2967938.2967950
[144]
Toomas Remmelg, Thibaut Lutz, Michel Steuwer, and Christophe Dubach. 2016. Performance Portable GPU Code Generation for Matrix Multiplication. In Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU ’16). Association for Computing Machinery, New York, NY, USA, 22–31. https://doi.org/10.1145/2884045.2884046
[145]
Bruce Sagan. 2001. The symmetric group: representations, combinatorial algorithms, and symmetric functions. Vol. 203. Springer Science & Business Media.
[146]
Caio Salvador Rohwedder, Nathan Henderson, João P. L. De Carvalho, Yufei Chen, and José Nelson Amaral. 2023. To Pack or Not to Pack: A Generalized Packing Analysis and Transformation. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2023). Association for Computing Machinery, New York, NY, USA, 14–27. https://doi.org/10.1145/3579990.3580024
[147]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556
[148]
Paul Springer and Paolo Bientinesi. 2016. Design of a high-performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016). arXiv:quant-ph/1607.00145 http://arxiv.org/abs/1607.00145
[149]
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP 2015). Association for Computing Machinery, New York, NY, USA, 205–217. https://doi.org/10.1145/2784731.2784754
[150]
Michel Steuwer, Philipp Kegel, and Sergei Gorlatch. 2011. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 1176–1182. https://doi.org/10.1109/IPDPS.2011.269
[151]
Michel Steuwer, Thomas Koehler, Bastian Köpcke, and Federico Pizzuti. 2022. RISE & Shine: Language-Oriented Compiler Design. CoRR abs/2201.03611 (2022). arXiv:2201.03611 https://arxiv.org/abs/2201.03611
[152]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. LIFT: A functional data-parallel IR for high-performance GPU code generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 74–85. https://doi.org/10.1109/CGO.2017.7863730
[153]
StreamHPC. 2016. Comparing Syntax for CUDA, OpenCL and HiP. https://streamhpc.com/blog/2016-04-05/comparing-syntax-cuda-opencl-hip/.
[154]
Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David R. Kaeli. 2019. Summarizing CPU and GPU Design Trends with Product Data. CoRR abs/1911.11313 (2019). arXiv:1911.11313 http://arxiv.org/abs/1911.11313
[155]
Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, and Claude Tadonki. 2020. Meta-programming for cross-domain tensor optimizations. SIGPLAN Not. 53, 9 (apr 2020), 79–92. https://doi.org/10.1145/3393934.3278131
[156]
Joel Svensson, Mary Sheeran, and Koen Claessen. 2011. Obsidian: A Domain Specific Embedded Language for Parallel Programming of Graphics Processors. In Implementation and Application of Functional Languages, Sven-Bodo Scholz and Olaf Chitil (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 156–173.
[157]
TensorFlow. 2022a. MobileNet v1 models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/mobilenet.py.
[158]
TensorFlow. 2022b. ResNet models for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/resnet.py.
[159]
TensorFlow. 2022c. VGG16 model for Keras. https://github.com/keras-team/keras/blob/master/keras/applications/vgg16.py.
[160]
Philippe Tillet and David Cox. 2017. Input-Aware Auto-Tuning of Compute-Bound HPC Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). Association for Computing Machinery, New York, NY, USA, Article 43, 12 pages. https://doi.org/10.1145/3126908.3126939
[161]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10–19. https://doi.org/10.1145/3315508.3329973
[162]
Uday Bondhugula. 2022. Pluto: An automatic polyhedral parallelizer and locality optimizer. https://github.com/bondhugula/pluto, commit = 12e075a, date = 31.10.2021.
[163]
Nicolas Vasilache, Oleksandr Zinenko, Aart J. C. Bik, Mahesh Ravishankar, Thomas Raoux, Alexander Belyaev, Matthias Springer, Tobias Gysi, Diego Caballero, Stephan Herhut, Stella Laurenzo, and Albert Cohen. 2022. Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction. arXiv:cs.PL/2202.03293
[164]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically. ACM Trans. Archit. Code Optim. 16, 4, Article 38 (Oct. 2019), 26 pages. https://doi.org/10.1145/3355606
[165]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages. https://doi.org/10.1145/2400682.2400713
[166]
Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral Extraction Tool. In International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France, Vol. 141.
[167]
Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-Bound GPU Applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 191–202. https://doi.org/10.1109/SC.2014.21
[168]
Bram Wasti, José Pablo Cambronero, Benoit Steiner, Hugh Leather, and Aleksandar Zlateski. 2022. LoopStack: a Lightweight Tensor Algebra Compiler Stack. https://doi.org/10.48550/ARXIV.2205.00618
[169]
R.C. Whaley and J.J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In SC ’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. 38–38. https://doi.org/10.1109/SC.1998.10004
[170]
Maurice V Wilkes. 2001. The memory gap and the future of high performance memories. ACM SIGARCH Computer Architecture News 29, 1 (2001), 2–7.
[171]
M.E. Wolf and M.S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems 2, 4 (1991), 452–471. https://doi.org/10.1109/71.97902
[172]
Rohan Yadav, Alex Aiken, and Fredrik Kjolstad. 2022. DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 286–300. https://doi.org/10.1145/3519939.3523437
[173]
Cambridge Yang, Eric Atkinson, and Michael Carbin. 2021. Simplifying Dependent Reductions in the Polyhedral Model. Proc. ACM Program. Lang. 5, POPL, Article 20 (Jan. 2021), 33 pages. https://doi.org/10.1145/3434301
[174]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020a. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863–879. https://www.usenix.org/conference/osdi20/presentation/zheng
[175]
Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, and Yun Liang. 2022. AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22). Association for Computing Machinery, New York, NY, USA, 874–887. https://doi.org/10.1145/3470496.3527440
[176]
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020b. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. Association for Computing Machinery, New York, NY, USA, 859–873. https://doi.org/10.1145/3373376.3378508

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems Just Accepted
EISSN:1558-4593
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 22 May 2024
Accepted: 16 April 2024
Revised: 21 February 2024
Received: 17 October 2023

Check for updates

Author Tags

  1. code generation
  2. data parallelism
  3. auto-tuning
  4. GPU
  5. CPU
  6. OpenMP
  7. CUDA
  8. OpenCL
  9. linear algebra
  10. stencils computation
  11. quantum chemistry
  12. data mining
  13. deep learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 403
    Total Downloads
  • Downloads (Last 12 months)403
  • Downloads (Last 6 weeks)70
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media