Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Schedule Synthesis for Halide Pipelines through Reuse Analysis

Published: 18 April 2019 Publication History

Abstract

Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to meet real-time constraints. The inherently complex structure found in most image-processing pipelines, the plethora of transformations that can be applied to optimize the performance of an implementation, as well as the interaction of these optimizations with locality, redundant computation and parallelism, can be indentified as the key reasons behind this issue. Recent domain-specific languages (DSL) such as the Halide DSL and compiler attempt to encourage high-level design-space exploration to facilitate the optimization process. We propose a novel optimization strategy that aims to maximize producer-consumer locality by exploiting reuse in image-processing pipelines. We implement our analysis as a tool that can be used alongside the Halide DSL to automatically generate schedules for pipelines implemented in Halide and test it on a variety of benchmarks. Experimental results on three different multi-core architectures show an average performance improvement of 40% over the Halide Auto-Scheduler and 75% over a state-of-the art approach that targets the PolyMage DSL.

References

[1]
Andrew Adams, Eino-Ville Talvala, Sung Hee Park, David E. Jacobs, Boris Ajdin, Natasha Gelfand, Jennifer Dolson, Daniel Vaquero, Jongmin Baek, Marius Tico, Hendrik P. A. Lensch, Wojciech Matusik, Kari Pulli, Mark Horowitz, and Marc Levoy. 2010. The Frankencamera: An experimental platform for computational photography. In ACM SIGGRAPH 2010 Papers (SIGGRAPH’10). ACM, New York, NY, Article 29, 12 pages.
[2]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 303--316.
[3]
Jiawen Chen, Sylvain Paris, and Frédo Durand. 2007. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. 26 (2007), 103.
[4]
D. Cociorva, J. W. Wilkins, C. Lam, G. Baumgartner, J. Ramanujam, and P. Sadayappan. 2001. Loop optimization for a class of memory-constrained computations. In Proceedings of the 15th International Conference on Supercomputing (ICS’01). ACM, New York, NY, 103--113.
[5]
J. Darbon, A. Cunha, T. F. Chan, S. Osher, and G. J. Jensen. 2008. Fast nonlocal filtering applied to electron cryomicroscopy. In Proceedings of the 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. 1331--1334.
[6]
Zeev Farbman, Raanan Fattal, and Dani Lischinski. 2011. Convolution pyramids. In Proceedings of the 2011 SIGGRAPH Asia Conference (SA’11). ACM, New York, NY.
[7]
Halide. 2018. Halide GitHub Repository (MIT License). Retrieved from https://github.com/halide/Halide (commit 402171e7a4dfacb0bd93297cbdfb600a325fe745).
[8]
Chris Harris and Mike Stephens. 1988. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference. 147--151.
[9]
Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 311--320.
[10]
Abhinav Jangda and Uday Bondhugula. 2018. An effective fusion and tile size model for optimizing image processing pipelines. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, NY, 261--275.
[11]
Ken Kennedy and Kathryn S. McKinley. 1994. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, Utpal Banerjee, David Gelernter, Alex Nicolau, and David Padua (Eds.). Springer, Berlin, 301--320.
[12]
J. Kim, J. K. Lee, and K. M. Lee. 2016. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1646--1654.
[13]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. SIGPLAN Not. 42, 6 (Jun. 2007), 235--244.
[14]
Naraig Manjikian and Tarek S. Abdelrahman. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst. 8, 2 (Feb. 1997), 193--209.
[15]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4, Article 83 (Jul. 2016), 11 pages.
[16]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443.
[17]
Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A study on balancing parallelism, data locality, and recomputation in existing PDE solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, Los Alamitos, CA, 793--804.
[18]
Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. In ACM SIGGRAPH 2011 Papers (SIGGRAPH’11). ACM, New York, NY, Article 68, 12 pages.
[19]
Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and John Cavazos. 2008. Iterative optimization in the polyhedral model: Part Ii, multidimensional time. SIGPLAN Not. 43, 6 (Jun. 2008), 90--100.
[20]
PolyMage project. 2016. PolyMage Repository (Apache 2.0 License 2016). Retrieved from https://bitbucket.org/udayb/polymage (commit 0ff0b46456605a5579db09c6ef98cb247dd2131d).
[21]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 519--530.
[22]
C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. 2011. Fast cost-volume filtering for visual correspondence and beyond. In Proceedings of the Annual Conference on Computer Vision and Pattern Recognition (CVPR’11). 3017--3024.
[23]
Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop transformations leveraging hardware prefetching. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO’18). ACM, New York, NY, 254--264.
[24]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. http://arxiv.org/abs/1802.04730
[25]
Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI’91). ACM, New York, NY, 30--44.
[26]
Jingling Xue. 2005. Aggressive loop fusion for improving locality and parallelism. In Proceedings of the 3rd International Conference on Parallel and Distributed Processing and Applications (ISPA’05). Springer-Verlag, Berlin, 224--238.
[27]
Qing Yi and Ken Kennedy. 2004. Improving memory hierarchy performance through combined loop interchange and multi-level fusion. Int. J. High Perf. Comput. Appl. 18, 2 (2004), 237--253.
[28]
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 207--218.

Cited By

View all
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2022)ACM Transactions on Graphics10.1145/3528223.353012541:4(1-24)Online publication date: 22-Jul-2022
  • (2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
June 2019
317 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3325131
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019
Accepted: 01 January 2019
Revised: 01 December 2018
Received: 01 July 2018
Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Halide
  2. Loop optimizations
  3. compiler optimizations
  4. image-processing pipelines

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)172
  • Downloads (Last 6 weeks)23
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
  • (2022)ACM Transactions on Graphics10.1145/3528223.353012541:4(1-24)Online publication date: 22-Jul-2022
  • (2022)A graph neural network-based performance model for deep learning applicationsProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming10.1145/3520312.3534863(11-20)Online publication date: 13-Jun-2022
  • (2021)Efficient automatic scheduling of imaging and vision pipelines for the GPUProceedings of the ACM on Programming Languages10.1145/34854865:OOPSLA(1-28)Online publication date: 20-Oct-2021
  • (2021)Thallo – Scheduling for High-Performance Large-Scale Non-Linear Least-Squares SolversACM Transactions on Graphics10.1145/345398640:5(1-14)Online publication date: 24-Sep-2021
  • (2020)Schedule Synthesis for Halide Pipelines on GPUsACM Transactions on Architecture and Code Optimization10.1145/340611717:3(1-25)Online publication date: 3-Aug-2020
  • (2019)Learning to optimize halide with tree search and random programsACM Transactions on Graphics10.1145/3306346.332296738:4(1-12)Online publication date: 12-Jul-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media