research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Authors:

Jonathan Ragan-Kelley,

Connelly Barnes,

Saman AmarasingheAuthors Info & Claims

PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 519 - 530

https://doi.org/10.1145/2491956.2462176

Published: 16 June 2013 Publication History

Abstract

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.

We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

[1]

A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010.

Digital Library

[2]

J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009.

Digital Library

[3]

M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.

[4]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.

Digital Library

[5]

J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007.

Digital Library

[6]

CoreImage. Apple CoreImage programming guide, 2006.

[7]

J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009.

Digital Library

[8]

C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.

[9]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006.

Digital Library

[10]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005.

Digital Library

[11]

M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[12]

P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.

[13]

J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.

Digital Library

[14]

IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.

[15]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.

[16]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007.

Digital Library

[17]

J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.

[18]

R. Moore. Interval Analysis. 1966.

[19]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010.

Digital Library

[20]

OpenMP. OpenMP. http://openmp.org/.

[21]

S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009.

Digital Library

[22]

S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011.

Digital Library

[23]

PixelBender. Adobe PixelBender reference, 2010.

[24]

H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991.

Digital Library

[25]

M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.

[26]

J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012.

Digital Library

[27]

M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994.

Digital Library

[28]

Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.

Digital Library

[29]

W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002.

Digital Library

[30]

P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989.

Digital Library

[31]

X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.

Cited By

Blaszynski PPalkowski MBielecki WPoliwoda M(2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
https://doi.org/10.3390/math12050728
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Liu YXue YCheng YMa LMiao ZXue JHuang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695955
Show More Cited By

Index Terms

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
PolyMage: Automatic Optimization for Image Processing Pipelines
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

This paper presents the design and implementation of PolyMage, a domain-specific language and compiler for image processing pipelines. An image processing pipeline can be viewed as a graph of interconnected stages which process images successively. Each ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2013

546 pages

ISBN:9781450320146

DOI:10.1145/2491956

General Chair:
Hans-J. Boehm
HP Labs
,
Program Chair:
Cormac Flanagan
University of California, Santa Cruz

ACM SIGPLAN Notices Volume 48, Issue 6
PLDI '13
June 2013
515 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499370
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '13

Sponsor:

SIGPLAN

PLDI '13: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 16 - 19, 2013

Washington, Seattle, USA

Acceptance Rates

PLDI '13 Paper Acceptance Rate 46 of 267 submissions, 17%;

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

923
Total Citations
View Citations
4,197
Total Downloads

Downloads (Last 12 months)654
Downloads (Last 6 weeks)111

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Blaszynski PPalkowski MBielecki WPoliwoda M(2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
https://doi.org/10.3390/math12050728
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Liu YXue YCheng YMa LMiao ZXue JHuang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695955
Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Zhou CQian BGo GZhang QLi SJiang Y(2024)PolyJuice: Detecting Mis-compilation Bugs in Tensor Compilers with Equality Saturation Based RewritingProceedings of the ACM on Programming Languages10.1145/36897578:OOPSLA2(1309-1335)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689757
Root AYan BLiu PGyurgyik CBik AKjolstad F(2024)Compilation of Shape Operators on Sparse ArraysProceedings of the ACM on Programming Languages10.1145/36897528:OOPSLA2(1162-1188)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689752
Lin JIkarashi YBernstein GMcCann J(2024)UFO Instruction Graphs Are Machine KnittableACM Transactions on Graphics10.1145/368794843:6(1-22)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3687948
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents