Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Declarative Loop Tactics for Domain-specific Optimization

Published: 26 December 2019 Publication History

Abstract

Increasingly complex hardware makes the design of effective compilers difficult. To reduce this problem, we introduce Declarative Loop Tactics, which is a novel framework of composable program transformations based on an internal tree-like program representation of a polyhedral compiler. The framework is based on a declarative C++ API built around easy-to-program matchers and builders, which provide the foundation to develop loop optimization strategies. Using our matchers and builders, we express computational patterns and core building blocks, such as loop tiling, fusion, and data-layout transformations, and compose them into algorithm-specific optimizations. Declarative Loop Tactics (Loop Tactics for short) can be applied to many domains. For two of them, stencils and linear algebra, we show how developers can express sophisticated domain-specific optimizations as a set of composable transformations or calls to optimized libraries. By allowing developers to add highly customized optimizations for a given computational pattern, we expect our approach to reduce the need for DSLs and to extend the range of optimizations that can be performed by a current general-purpose compiler.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265--283.
[2]
Varun Agrawal, Abhiroop Dabral, Tapti Palit, Yongming Shen, and Michael Ferdman. 2015. Architectural support for dynamic linking. ACM SIGARCH Comput. Archit. News, Vol. 43. ACM, 691--702.
[3]
Manuel Arenaz, Juan Touriño, and Ramon Doallo. 2008. XARK: An extensible framework for automatic recognition of computational kernels. ACM Trans. Prog. Lang. Syst. 30, 6 (2008), 32.
[4]
Lénaïc Bagnères, Oleksandr Zinenko, Stéphane Huot, and Cédric Bastoul. 2016. Opening polyhedral compiler’s black box. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, New York, NY, 128--138.
[5]
William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, David Padua, Paul Petersen, William Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weatherford. 1995. Polaris: Improving the effectiveness of parallelizing compilers. In Languages and Compilers for Parallel Computing, Keshav Pingali, Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua (Eds.). Springer Berlin, 141--154.
[6]
U. Bondhugula, S. Dash, O. Gunluk, and L. Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 343--352.
[7]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical and fully automatic polyhedral program optimization system. In Proceedings of the ACM SIGPLAN Symposium on Programming Language Design and Implementation (PLDI’08).
[8]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43, 6 (2008), 101--113.
[9]
Martin Bravenboer, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Visser. 2008. Stratego/XT 0.17. A language and toolset for program transformation. Sci. Comput. Prog. 72, 1 (2008), 52--70.
[10]
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-level Loop Transformations. Technical Report. Citeseer. USC Computer Science.
[11]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578--594.
[12]
B. Di Martino and G. Iannello. 1996. PAP recognizer: A tool for automatic recognition of parallelizable patterns. In Proceedings of the 4th Workshop on Program Comprehension (WPC’96) 164--174.
[13]
Sebastien Donadio, James Brodman, Thomas Roeder, Kamen Yotov, Denis Barthou, Albert Cohen, María Jesús Garzarán, David Padua, and Keshav Pingali. 2006. A language for the compact representation of multiple program versions. In Languages and Compilers for Parallel Computing, Eduard Ayguadé, Gerald Baumgartner, J. Ramanujam, and P. Sadayappan (Eds.). Springer Berlin, 136--151.
[14]
Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer, Boston, MA, 1581--1592.
[15]
Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. High-performance generalized tensor operations: A compiler-oriented approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (Sept. 2018), 27 pages.
[16]
Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. 1995. Beyond induction variables: Detecting and classifying sequences using a demand-driven SSA form. ACM Trans. Prog. Lang. Syst. 17, 1 (Jan. 1995), 85--122.
[17]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Prog. 34, 3 (June 2006), 261--317.
[18]
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Article 66, 10 pages.
[19]
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly-Performing polyhedral optimizations on a low-level intermediate representation. Parallel Proc. Lett. 22, 4 (2012), 1250010.
[20]
Tobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, and Sebastian Pop. 2015. Optimistic delinearization of parametrically sized arrays. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 351--360.
[21]
Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Vol. 2011. 1.
[22]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short-vector SIMD architectures. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11). Springer-Verlag, 225--245. http://dl.acm.org/citation.cfm?id=1987237.1987255.
[23]
Wayne Kelly and William Pugh. 1998. A Framework for Unifying Reordering Transformations. Technical Report. University of Maryland.
[24]
Christoph W. Kessler. 1996. Pattern-driven automatic parallelization. Sci. Prog. 5, 3 (Aug. 1996), 251--274.
[25]
C. W. Kessler and C. H. Smith. 1999. The SPARAMAT approach to automatic comprehension of sparse matrix computations. In Proceedings of the 7th International Workshop on Program Comprehension. 200--207.
[26]
Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, and Jacqueline Chame. 2013. A script-based autotuning compiler system to generate high-performance CUDA code. ACM Trans. Archit. Code Optim. 9, 4, Article 31 (Jan. 2013), 25 pages.
[27]
Martin Kong and Louis-Noël Pouchet. 2019. Model-driven transformations for multi-and many-core CPUs. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 469--484.
[28]
W. Kozaczynski, J. Ning, and A. Engberts. 1992. Program concept recognition and transformation. IEEE Trans. Softw. Eng. 18, 12 (Dec. 1992), 1065--1075.
[29]
Michael Kruse and Hal Finkel. 2018. A proposal for loop-transformation pragmas. In Evolving OpenMP for Evolving Architectures, Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesus Labarta (Eds.). Springer International Publishing, Cham, 37--52.
[30]
Sang-Ik Lee, Troy A. Johnson, and Rudolf Eigenmann. 2004. Cetus—An extensible compiler infrastructure for source-to-source transformation. In Languages and Compilers for Parallel Computing, Lawrence Rauchwerger (Ed.). Springer Berlin, 539--553.
[31]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016), 18 pages.
[32]
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. 2018. NVIDIA tensor core programmability, performance precision. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). 522--531.
[33]
Robert Metzger and Zhaofang Wen. 2000. Automatic Algorithm Recognition and Replacement: A New Approach to Program Optimization. The MIT Press.
[34]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News 43, 1 (Mar. 2015), 429--443.
[35]
Shlomit S. Pinter and Ron Y. Pinter. 1994. Program optimization and parallelization using idioms. ACM Trans. Prog. Lang. Syst. 16, 3 (May 1994), 305--327.
[36]
Sebastian Pop, Albert Cohen, Cédric Bastoul, Sylvain Girbal, Georges-André Silber, and Nicolas Vasilache. 2006. GRAPHITE: Polyhedral analyses and optimizations for GCC. In Proceedings of the GCC Developers Summit. Citeseer, 2006.
[37]
Bill Pottenger and Rudolf Eigenmann. 1995. Idiom recognition in the Polaris parallelizing compiler. In Proceedings of the 9th International Conference on Supercomputing (ICS’95). ACM, New York, NY, 444--448.
[38]
William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Prog. Lang. Syst. 16, 4 (1994), 1248--1278.
[39]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not., Vol. 48. ACM, 519--530.
[40]
Gabe Rudy. 2010. CUDA-CHiLL: A Programming Language Interface for GPGPU Optimizations and Code Generation. Ph.D. Dissertation. School of Computing, University of Utah.
[41]
Amin Shafiee Sarvestani, Erik Hansson, and Christoph Kessler. 2013. Extensible recognition of algorithmic patterns in DSP programs for automatic parallelization. Int. J. Parallel Prog. 41, 6 (1 Dec. 2013), 806--824.
[42]
Gordon D. Smith. 1985. Numerical Solution of Partial Differential Equations: Finite Difference Methods. Oxford University Press.
[43]
Toshio Suganuma, Hideaki Komatsu, and Toshio Nakatani. 1996. Detection and global optimization of reduction operations for distributed parallel machines. In Proceedings of the 10th International Conference on Supercomputing (ICS’96). ACM, New York, NY, 18--25.
[44]
Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13, 4s, Article 134 (Apr. 2014), 25 pages.
[45]
Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: Generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th International Conference on Generative Programming: Concepts 8 Experiences (GPCE’13). ACM, New York, NY, 145--154.
[46]
Arvind K. Sujeeth, Tiark Rompf, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Victoria Popic, Michael Wu, Aleksandar Prokopec, Vojin Jovanovic, Martin Odersky, and Kunle Olukotun. 2013. Composition and reuse with compiled domain-specific languages. In Proceedings of the European Conference on Object-Oriented Programming (ECOOP’13), Giuseppe Castagna (Ed.). Springer Berlin, 52--78.
[47]
Allen Taflove and Susan C. Hagness. 2005. Computational Electrodynamics: The Finite-difference Time-domain Method. Artech House.
[48]
Thiago S. F. X. Teixeira, Corinne Ancourt, David Padua, and William Gropp. 2019. Locus: A system and a language for program optimization. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, Piscataway, NJ, 217--228. http://dl.acm.org/citation.cfm?id=3314872.3314898.
[49]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR abs/1802.04730 (2018). arxiv:1802.04730 http://arxiv.org/abs/1802.04730.
[50]
Sven Verdoolaege. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the 3rd International Congress Conference on Mathematical Software (ICMS’10). Springer, Berlin, 299--302. http://dl.acm.org/citation.cfm?id=1888390.1888455.
[51]
Sven Verdoolaege. 2011. Counting affine calculator and applications. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11).
[52]
Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral extraction tool. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques (IMPACT’12).
[53]
Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule trees. In Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT’14, Associated with HiPEAC’14). 9.
[54]
Sven Verdoolaege and Alexandre Isoard. 2017. Consecutivity in the isl polyhedral scheduler. (2017).
[55]
Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. Department of Computer Science, KU Leuven, Leuven, Belgium.
[56]
Qing Yi. 2012. POET: A scripting language for applying parameterized source-to-source program transformations. Softw. Pract. Exper. 42, 6 (June 2012), 675--706.
[57]
Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye. 2012. Alphaz: A system for design space exploration in the polyhedral model. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing. Springer, 17--31.
[58]
Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction. ACM, 3--13.

Cited By

View all
  • (2024)Substitution of kernel functions based on pattern matching on schedule treesWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678152(48-57)Online publication date: 12-Aug-2024
  • (2022)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 18-Nov-2022
  • (2021)LoopOptProceedings of the 24th International Workshop on Software and Compilers for Embedded Systems10.1145/3493229.3493301(11-16)Online publication date: 1-Nov-2021
  • Show More Cited By

Index Terms

  1. Declarative Loop Tactics for Domain-specific Optimization

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 4
    December 2019
    572 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3366460
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 December 2019
    Accepted: 01 November 2019
    Revised: 01 October 2019
    Received: 01 August 2019
    Published in TACO Volume 16, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Loop tactics
    2. declarative loop optimizations
    3. polyhedral model

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)280
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 25 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Substitution of kernel functions based on pattern matching on schedule treesWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678152(48-57)Online publication date: 12-Aug-2024
    • (2022)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 18-Nov-2022
    • (2021)LoopOptProceedings of the 24th International Workshop on Software and Compilers for Embedded Systems10.1145/3493229.3493301(11-16)Online publication date: 1-Nov-2021
    • (2021)Polyhedral-Based Compilation Framework for In-Memory Neural Network AcceleratorsACM Journal on Emerging Technologies in Computing Systems10.1145/346984718:1(1-23)Online publication date: 29-Sep-2021
    • (2021)Polygeist: Raising C to Polyhedral MLIRProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00011(45-59)Online publication date: 26-Sep-2021
    • (2021)Progressive raising in multi-level IRProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370332(15-26)Online publication date: 27-Feb-2021
    • (2021)UNITProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370330(77-89)Online publication date: 27-Feb-2021
    • (2021)MLIRProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370308(2-14)Online publication date: 27-Feb-2021
    • (2020)TDO-CIMProceedings of the 23rd Conference on Design, Automation and Test in Europe10.5555/3408352.3408716(1602-1605)Online publication date: 9-Mar-2020
    • (2020)TDO-CIM: Transparent Detection and Offloading for Computation In-memory2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116464(1602-1605)Online publication date: Mar-2020
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media