Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2491956.2462176acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Published: 16 June 2013 Publication History

Abstract

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values.
We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.

References

[1]
A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph., 29(4), 2010.
[2]
J. Ansel, C. Chan, Y. L.Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In ACM Programming Language Design and Implementation, 2009.
[3]
M. Aubry, S. Paris, S. W. Hasinoff, J. Kautz, and F. Durand. Fast and robust pyramid-based image processing. Technical Report MIT-CSAILTR- 2011-049, Massachusetts Institute of Technology, 2011.
[4]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. In SIGGRAPH, 2004.
[5]
J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph., 26(3), 2007.
[6]
CoreImage. Apple CoreImage programming guide, 2006.
[7]
J. L. T. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, and B. Nicoletti. High-performance SIMT code generation in an active visual effects library. In Conf. on Computing Frontiers, 2009.
[8]
C. Elliott. Functional image synthesis. In Proceedings of Bridges, 2001.
[9]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In ACM/IEEE conference on Supercomputing, 2006.
[10]
M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, 2005.
[11]
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In International Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.
[12]
P. L. Guernic, A. Benveniste, P. Bournai, and T. Gautier. Signal -- A data flow-oriented language for signal processing. IEEE Transactionsn on Acoustics, Speech and Signal Processing, 34(2):362--374, 1986.
[13]
J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012.
[14]
IPP. Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/intel-ipp/.
[15]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S.Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, 2010.
[16]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007.
[17]
J. Meng and K. Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. In IJPP, 2011.
[18]
R. Moore. Interval Analysis. 1966.
[19]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In Supercomputing, 2010.
[20]
OpenMP. OpenMP. http://openmp.org/.
[21]
S. Paris, P. Kornprobst, J. Tumblin, and F. Durand. Bilateral filtering: Theory and applications. Foundations and Trends in Computer Graphics and Vision, 2009.
[22]
S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: Edgeaware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4), 2011.
[23]
PixelBender. Adobe PixelBender reference, 2010.
[24]
H. Printz. Automatic Mapping of Large Signal Processing Systems to a Parallel Machine. Ph.D. Thesis, Carnegie Mellon University, 1991.
[25]
M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE, volume 93, 2005.
[26]
J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph., 31(4), 2012.
[27]
M. A. Shantzis. A model for efficient and flexible image computing. In ACM SIGGRAPH, 1994.
[28]
Y. Tang, R. Chowdhury, B. Kuszmaul, C.-K. Luk, and C. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.
[29]
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In International Conference on Compiler Construction, 2002.
[30]
P.-S. Tseng. A Parallelizing Compiler for Disributed Memory Parallel Computers. PhD thesis, Carnegie Mellon University, 1989.
[31]
X. Zhou, J.-P. Giacalone, M. J. Garzarán, R. H. Kuhn, Y. Ni, and D. Padua. Hierarchical overlapped tiling.

Cited By

View all
  • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
  • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
  • (2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
  • Show More Cited By

Index Terms

  1. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2013
    546 pages
    ISBN:9781450320146
    DOI:10.1145/2491956
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 6
      PLDI '13
      June 2013
      515 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2499370
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. autotuning
    2. compiler
    3. domain specific language
    4. gpu
    5. image processing
    6. locality
    7. optimization
    8. parallelism
    9. redundant computation
    10. vectorization

    Qualifiers

    • Research-article

    Conference

    PLDI '13
    Sponsor:

    Acceptance Rates

    PLDI '13 Paper Acceptance Rate 46 of 267 submissions, 17%;
    Overall Acceptance Rate 406 of 2,067 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)654
    • Downloads (Last 6 weeks)111
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficiency of Various Tiling Strategies for the Zuker Algorithm OptimizationMathematics10.3390/math1205072812:5(728)Online publication date: 29-Feb-2024
    • (2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
    • (2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
    • (2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
    • (2024)PolyJuice: Detecting Mis-compilation Bugs in Tensor Compilers with Equality Saturation Based RewritingProceedings of the ACM on Programming Languages10.1145/36897578:OOPSLA2(1309-1335)Online publication date: 8-Oct-2024
    • (2024)Compilation of Shape Operators on Sparse ArraysProceedings of the ACM on Programming Languages10.1145/36897528:OOPSLA2(1162-1188)Online publication date: 8-Oct-2024
    • (2024)UFO Instruction Graphs Are Machine KnittableACM Transactions on Graphics10.1145/368794843:6(1-22)Online publication date: 19-Dec-2024
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
    • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
    • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media