research-article

Darkroom: compiling high-level image processing code into hardware pipelines

Authors:

John Brunhaver,

Zachary DeVito,

Jonathan Ragan-Kelley,

Artem Vasilyev,

Pat HanrahanAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 33, Issue 4

Article No.: 144, Pages 1 - 11

https://doi.org/10.1145/2601097.2601174

Published: 27 July 2014 Publication History

Abstract

Specialized image signal processors (ISPs) exploit the structure of image processing pipelines to minimize memory bandwidth using the architectural pattern of line-buffering, where all intermediate data between each stage is stored in small on-chip buffers. This provides high energy efficiency, allowing long pipelines with tera-op/sec. image processing in battery-powered devices, but traditionally requires painstaking manual design in hardware. Based on this pattern, we present Darkroom, a language and compiler for image processing. The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM. We formulate the problem of optimally scheduling line-buffered pipelines to minimize buffering as an integer linear program. Finally, given an optimally scheduled pipeline, Darkroom synthesizes hardware descriptions for ASIC or FPGA, or fast CPU code. We evaluate Darkroom implementations of a range of applications, including a camera pipeline, low-level feature detection algorithms, and deblurring. For many applications, we demonstrate gigapixel/sec. performance in under 0.5mm² of ASIC silicon at 250 mW (simulated on a 45nm foundry process), real-time 1080p/60 video processing using a fraction of the resources of a modern FPGA, and tens of megapixels/sec. of throughput on a quad-core x86 processor.

Supplementary Material

ZIP File (a144-hegarty.zip)

Supplemental material.

Download
374.13 MB

MP4 File (a144-sidebyside.mp4)

Download
25.31 MB

References

[1]

Adams, A., Talvala, E.-V., Park, S. H., Jacobs, D. E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H. P. A., Matusik, W., Pulli, K., Horowitz, M., and Levoy, M. 2010. The Frankencamera: An experimental platform for computational photography. ACM Transactions on Graphics 29, 4 (July), 29:1--29:12.

Digital Library

[2]

Aptina. Aptina MT9P111. http://www.aptina.com/products/soc/mt9p111/.

[3]

Berkelaar, M., Eikland, K., Notebaert, P., et al. 2004. lpsolve: Open source (mixed-integer) linear programming system. Eindhoven U. of Technology.

[4]

Bilsen, G., Engels, M., Lauwereins, R., and Peperstraete, J. 1995. Cyclo-static data flow. In 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 3255--3258.

[5]

Bouguet, J.-Y. 2001. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Tech. rep., Intel Corporation.

[6]

Canny, J. 1986. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 679--698.

Digital Library

[7]

Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, 4.

Digital Library

[8]

DeVito, Z., Hegarty, J., Aiken, A., Hanrahan, P., and Vitek, J. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 105--116.

Digital Library

[9]

Elliott, C. 2001. Functional image synthesis. In Proceedings of Bridges.

[10]

Frigo, M., and Strumpen, V. 2005. Cache oblivious stencil computations. In Proceedings of the 19th annual international conference on Supercomputing, ACM, 361--366.

Digital Library

[11]

Gummaraju, J., and Rosenblum, M. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 343--354.

Digital Library

[12]

Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B. C., Richardson, S., Kozyrakis, C., and Horowitz, M. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ACM, 37--47.

Digital Library

[13]

Harris, C., and Stephens, M. 1988. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, 147--151.

[14]

Holzmann, G. 1988. Beyond Photography: The Digital Darkroom. Prentice Hall.

Digital Library

[15]

Kung, H. T. 1979. Let's design algorithms for VLSI systems. In Proceedings of the Caltech Conference on Very Large Scale Integration.

[16]

Lattner, C., and Adve, V. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04).

Digital Library

[17]

Lee, E. A., and Messerschmitt, D. G. 1987. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers 100, 1, 24--35.

Digital Library

[18]

Leiserson, C. E., and Saxe, J. B. 1991. Retiming synchronous circuitry. Algorithmica 6, 1--6, 5--35.

Digital Library

[19]

Lucas, B. D., Kanade, T., et al. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI, vol. 81, 674--679.

Digital Library

[20]

Malladi, K., Nothaft, F., Periyathambi, K., Lee, B., Kozyrakis, C., and Horowitz, M. 2012. Towards energy-proportional datacenter memory with mobile dram. In 2012 39th Annual International Symposium on Computer Architecture (ISCA), 37--48.

Digital Library

[21]

Muralimanohar, N., and Balasubramonian, R. 2009. Cacti 6.0: A tool to understand large caches. Tech. rep., HP Labs.

[22]

Murthy, P., Bhattacharyya, S., and Lee, E. 1997. Joint minimization of code and data for synchronous dataflow programs. Formal Methods in System Design 11, 1, 41--70.

Digital Library

[23]

Nguyen, A., Satish, N., Chhugani, J., Kim, C., and Dubey, P. 2010. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In in Proc. of the 2010 ACM/IEEE Intl Conf. for High Performance Computing, Networking, Storage and Analysis, 2010, 1--13.

Digital Library

[24]

OpenCV. OpenCV. http://opencv.org/.

[25]

Qualcomm. Qualcomm hexagon SDK. https://developer.qualcomm.com/mobile-development/maximize-hardware/mobile-multimedia-optimization-hexagon-sdk.

[26]

Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics (TOG) 31, 4, 32.

Digital Library

[27]

Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, 519--530.

Digital Library

[28]

Richardson, W. H. 1972. Bayesian-based iterative method of image restoration. JOSA 62, 1, 55--59.

[29]

Shacham, O., Galal, S., Sankaranarayanan, S., Wachs, M., Brunhaver, J., Vassiliev, A., Horowitz, M., Danowitz, A., Qadeer, W., and Richardson, S. 2012. Avoiding game over: Bringing design to the next level. In Proceedings of the 49th Annual Design Automation Conference (DAC), 623--629.

Digital Library

[30]

Shantzis, M. A. 1994. A model for efficient and flexible image computing. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, ACM, 147--154.

Digital Library

[31]

Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P. 2009. Gramps: A programming model for graphics pipelines. ACM Transactions on Graphics (TOG) 28, 1 (Feb.), 4:1--4:11.

Digital Library

[32]

Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K., and Leiserson, C. E. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, ACM, 117--128.

Digital Library

[33]

Vivado. vivado. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/.

Cited By

Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Xiao YLuo ZZhou KLiang YZhang ZPutnam A(2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637561
Show More Cited By

Index Terms

Darkroom: compiling high-level image processing code into hardware pipelines

Recommendations

Rigel: flexible multi-rate image processing hardware

Image processing algorithms implemented using custom hardware or FPGAs of can be orders-of-magnitude more energy efficient and performant than software. Unfortunately, converting an algorithm by hand to a hardware description language suitable for ...
Spatial: a language and compiler for application accelerators
PLDI '18

Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for ...
Spatial: a language and compiler for application accelerators
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 33, Issue 4

July 2014

1366 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2601097

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 July 2014

Published in TOG Volume 33, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

132
Total Citations
View Citations
1,198
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)10

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Xiao YLuo ZZhou KLiang YZhang ZPutnam A(2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637561
Gupta SDwarkadas S(2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00084
Kanetaka YTakagi HMaeda YFukushima N(2024)SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for HalideIEEE Access10.1109/ACCESS.2023.334566012(7563-7583)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3345660
Shi BDai TZou SWei XLuo G(2024)ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRAEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_5(61-76)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_5
Zhang FTian MLi ZXu BLu QGao CSang NOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Lookup table meets local laplacian filterProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668632(57558-57569)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668632
Frolov VGalaktionov V(2023)A no-API approach to massive-parallel architecturesKeldysh Institute Preprints10.20948/prepr-2023-58(1-54)Online publication date: 2023
https://doi.org/10.20948/prepr-2023-58
Choudhury ZGulati APurini S(2023)FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific CompilerACM Transactions on Architecture and Code Optimization10.1145/362952320:4(1-25)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629523
Majumder KBondhugula UAamodt TSwift MJerger N(2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624767
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents