Article

Compiling for stream processing

Authors:

William J. Dally,

Peter MattsonAuthors Info & Claims

PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques

Pages 33 - 42

https://doi.org/10.1145/1152154.1152164

Published: 16 September 2006 Publication History

Abstract

This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels.

References

[1]

The tendra open source project. http://www.tendra.org.

[2]

T. Adam, K. Chandy, and J. Dickson. A comparison of list scheduling for parallel processing systems. J. Communications of the ACM, vol. 17, no. 17, pages 685--690, December 1974.

Digital Library

[3]

J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. Evaluating the imagine stream architecture. In Proceedings of the Annual International Symposium on Computer Architecture, June 2004.

Digital Library

[4]

ClearSpeed. CSX600 datasheet. http://www.clearspeed.com/downloads/CSX600Processor.pdf, 2005.

[5]

T. F. Coleman and Y. Li. An interior trust region approach for nonlinear minimization subject to bounds. SIAM Journal on Optimization, pages 418--445, 1996.

[6]

T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1989.

Digital Library

[7]

W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J. H. Ahn, N. Jayasena, U. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, page 35, Phoenix, Arizona, 2003.

Digital Library

[8]

M. Erez, J. H. Ahn, A. Garg, W. J. Dally, and E. Darve. Analysis and performance results of a molecular modeling application on merrimac. In Proceedings of the SC'04 Conference, Pittsburgh, Pennsylvaniva, November, 2004.

Digital Library

[9]

J. Fabri. Automatic storage optimization. In Proceedings of the ACM SIGPLAN 1979 Symposium on Compiler Construction, pages 83--91, 1979.

Digital Library

[10]

J. Gergov. Algorithms for compile-time memory optimization. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 907--908, 1999.

Digital Library

[11]

D. R. Kaiser. Loop Optimization Techniques on Multi-Issue Architectures. PhD thesis, University of Michigan, 1994.

Digital Library

[12]

T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. In Proceedings of the 15th Computer Vision and Pattern Recognition Conference, pages 196--202, San Francisco, CA, June 18--20, 1996.

Digital Library

[13]

H. Kasahara and S. Narita. Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE Transactions on Computers, no. 11, pages 1023--1029, November 1984.

[14]

B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, and A. Chang. Imagine: Media processing with streams. IEEE Micro, pages 35--46, Mar/Apr 2001.

Digital Library

[15]

M. Lam. Software pipelining: An effective scheduling technique for vliw machines. In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI), pages 318--328, 1988.

Digital Library

[16]

P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2001.

Digital Library

[17]

P. Mattson, W. J. Dally, S. Rixner, U. J. Kapasi, and J. D. Owens. Communication scheduling. In Proceedings of the 9th international conference on Architectural support for programming languages and operating systems(ASPLOS), volume 35, pages 82--92, Cambridge, MA, November 2000.

Digital Library

[18]

J. D. Owens. Computer Graphics on a Stream Architecture. PhD thesis, Stanford University, 2002.

[19]

D. Pham, S. A. M. Bollinger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D.Stasiak, M. S. an M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first generation cell processor. In ISSCC Digest of Technical Papers, San Francisco, CA, Feb, 2005.

[20]

S. Rixner. Stream Processor Architecture. Kluwer Academic Publishers, Boston, MA, 2001.

Digital Library

[21]

D. van der Spoel et al. Gromacs user manual version 3.1. http://www.gromacs.org, 2001.

[22]

M. Wolfe. More iteration space tiling. In Proceedings of the Supercomputing 89, pages 655--664, New York, NY, November 1989.

Digital Library

Cited By

Zhang NLacouture RSohn GMure PZhang QKjolstad FOlukotun K(2024)The Dataflow Abstract Machine Simulator Framework2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00046(532-547)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00046
Nguyen QSanchez D(2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071026
Nguyen QSanchez D(2020)Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00056(596-608)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00056
Show More Cited By

Index Terms

Compiling for stream processing
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Specialized application languages

Recommendations

Exploiting loop-dependent stream reuse for stream processors
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted ...
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we ...
Exploiting the reuse supplied by loop-dependent stream references for stream processors

Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques

September 2006

308 pages

ISBN:159593264X

DOI:10.1145/1152154

General Chair:
Erik Altman
IBM Research, USA
,
Program Chairs:
Kevin Skadron
University of Virginia, USA
,
Ben Zorn
Microsoft Research, USA

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PACT06

Sponsor:

ACM

PACT06: 2006 International Conference on Parallel Architectures and Compilation Techniques

September 16 - 20, 2006

Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
1,963
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang NLacouture RSohn GMure PZhang QKjolstad FOlukotun K(2024)The Dataflow Abstract Machine Simulator Framework2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00046(532-547)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00046
Nguyen QSanchez D(2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071026
Nguyen QSanchez D(2020)Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00056(596-608)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00056
Sano KNakahara H(2018)Hardware AlgorithmsPrinciples and Structures of FPGAs10.1007/978-981-13-0824-6_6(137-177)Online publication date: 4-Sep-2018
https://doi.org/10.1007/978-981-13-0824-6_6
Jiang GLi ZWang FWei SNapoli CSalapura VFranke HHou R(2015)Scheduling stream programs with improving arithmetic unit usage on NoC-based VLIW multi-core architecturesProceedings of the 12th ACM International Conference on Computing Frontiers10.1145/2742854.2742872(1-8)Online publication date: 6-May-2015
https://dl.acm.org/doi/10.1145/2742854.2742872
Kruliš MBednárek DFalt ZYaghob JZavoral F(2015)Towards Semi-automated Parallelization of Data Stream ProcessingIntelligent Distributed Computing IX10.1007/978-3-319-25017-5_22(235-245)Online publication date: 18-Oct-2015
https://doi.org/10.1007/978-3-319-25017-5_22
Falt ZBednárek DKruliš MYaghob JZavoral FPlale BRipeanu MCappello FXu D(2014)BobolangProceedings of the 23rd international symposium on High-performance parallel and distributed computing10.1145/2600212.2600711(311-314)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2600212.2600711
Jiang GLi ZWang FWei S(2014)A High-Utilization Scheduling Schemeof Stream Programs on ClusteredVLIW Stream ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.8025:4(840-850)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1109/TPDS.2013.80
Zhang YLi GSun CZhou HWang F(2014)Exploiting the Inter-cluster Record Reuse for Stream ProcessorsProceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)10.1109/HPCC.2014.159(916-921)Online publication date: 20-Aug-2014
https://dl.acm.org/doi/10.1109/HPCC.2014.159
Navarro AAsenjo RCorbera FDios AZapata E(2014)A case study of different task implementations for multioutput stages in non-trivial parallel pipeline applicationsParallel Computing10.1016/j.parco.2014.05.00340:8(374-393)Online publication date: Aug-2014
https://doi.org/10.1016/j.parco.2014.05.003
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents