article

Free access

Software pipelining

Authors:

Vicki H. Allan,

Reese B. Jones,

Randall M. Lee,

Stephen J. AllanAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 27, Issue 3

Pages 367 - 432

https://doi.org/10.1145/212094.212131

Published: 01 September 1995 Publication History

Abstract

Utilizing parallelism at the instruction level is an important way to improve performance. Because the time spent in loop execution dominates total execution time, a large body of optimizations focuses on decreasing the time to execute each iteration. Software pipelining is a technique that reforms the loop so that a faster execution rate is realized. Iterations are executed in overlapped fashion to increase parallelism.

Let {ABC}ⁿ represent a loop containing operations A, B, C that is executed n times. Although the operations of a single iteration can be parallelized, more parallelism may be achieved if the entire loop is considered rather than a single iteration. The software pipelining transformation utilizes the fact that a loop {ABC}ⁿ is equivalent to A{BCA}ⁿ⁻¹BC. Although the operations contained in the loop do not change, the operations are from different iterations of the original loop.

Various algorithms for software pipelining exist. A comparison of the alternative methods for software pipelining is presented. The relationships between the methods are explored and possibilities for improvement highlighted.

References

[1]

AHo, A V, SETm, R, AND ULLMAN, J D 1988 Compilers. Principles, Technzques, and Tools. Addison-Wesley, Reading, MA

[2]

AIKEN, A. 1988 Compaction-based parallehzation. Corne}l Umvermty, Dept of Computer Science, Ph D. Thesis. Ithaca, NY

[3]

AIKEN, A. AND NiCOL^U, A 1988a. A development enmronment for horizontal mmrocode IEEE Trans. Softu'. Eng. 14, 5 (May), 584 594

[4]

AIKEN, A AND NICOLAU, A. 1988b Optimal loop parallehzatlon in Proceedings o/the SIGPLAN '88 Conference on Programrnzng Language Design and Implementatton (Atlanta, CA, June), 308 317

[5]

AIKEN, A. AND NICOLAU, A. 1988e. Perfect pipehnmg: A new loop optimization techmque. In Proceedings of the 1988 European Symposzum on Programming, Springer Verlag Lecture Notes in Computer Sczence, #300 (Atlanta, CA, March), 221 235

[6]

AIKEN, A. ANB NICOLAU, A. 1990 A realistic resource-constrained software pipehnmg algorithm Tech. Rep. RJ 7743, IBM Research Division, San Jose, CA, October

[7]

ALLAN, V. H. m'~D O'NEILL, M.R. 1994. Software pipehning: A genetic algorithm approach. In PACT 94 (Montreal, Canada, Aug. 23 26). North-Holland, 2Mnsterdam, 311 314.

[8]

AI,LAN, V. H, RAJAGOPALAN, M, AND LEE, R M. 1993. Software plpelining- Petri net pacemaker. In Working Conference on Architectures and Cornp~latmn Techniques for Fine and Medium Gram Parallelism (Orlando, FL, Jan 0 22). North-Holland, Amsterdam, }5-26

[9]

BANERJEE, U., SHEN, S., KUCK, D J., AND TOWLE, R. A 1979 Time and parallel precentor bounds for Fortran-like loops. IEEE Trans Comput C-28, 9 (Sept), 660 670

[10]

BEAT5~, S.J. 1991. Instruction scheduling using genetic algorithms. Colorado State University, Fort Collins, CO, Ph.D Thesis.

[11]

BECK, G. R., YEN, D W L., ~ND ANDERSON, T L 1993 The Cydra 5 minicomputer: Architecture and implementation. In ft. Supercomput (May), 143-180

[12]

BRETEIgNiTZ. M., Jig. 1991. Architecture synthesis of }ugh-performance application-specific processors. Carnegie Mellon University, Pittsburgh, PA, Ph D Thesis.

[13]

CHANDY, K. M. AND KESSELMAN, C. 1991. Parallel programming in 2001. IEEE Softw. 8, 6 (Nov.), 11 20.

[14]

DAVIDSON, E. S. 1971. The design and control of pipelined function generators. In Proceedings of the 1971 Internatmnal IEEE Conference on Systems, Networks, and Computers' (Oaxtepec, Mexico, Jan.), 19-21.

[15]

DEHNERT, J. C., HSU, P. Y.-T., ANn BRATT, J. P. 1989. Overlapped loop support in the Cydra-5. In Proceedings of the Thtrd International Conference on Architectural Support for Programruing Languages and Operating Systems (Boston, MA, April), 26 38.

[16]

DEHNERT, J. C. AND TOWLE, R.A. 1993. Compiling foc the Cydra-5. In d. Supercomput. (May), 181 228.

[17]

EBCiO~LU, K. 1987. A compilation technique for software pipelining of loops with conditional jumps. In Proceedings of the 20th Microprogramming Workshop (MICRO-20) (Colorado Springs, CO, Dec.), 69-79.

[18]

EBCIO~LU, K. AND NAKATANI, T. 1990. A new compilation technique for parallel(zing loops with unpredictable branches on a VLIW architecture. In Languages and Compilers for Parallel Computing, D. Gelernter, Ed., MIT Press, Cambridge, MA, 213-229.

[19]

FERRANTE, J., OTTENSTEIN, K. J., AND WARREN, J. n. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July), 319-349.

[20]

FISHER, J. 1981. Trace scheduling: A technique for global microcode compaction. {EEE Trans. Comput. C-30, 7 (July), 478 490.

[21]

FREUDENBERGER, S. M., GROSS, T. R., AND LOWNEY, P.G. 1994. Avoidance and suppression of compensation code in a trace scheduler ACM Trans. Program. Lang. Syst. 16, 4 (July), 1156 1214.

[22]

GAO, G. R., WONG, W.-B., AND NING, Q. 1991a. A timed Petri-net model for fine-grain loop scheduling. In Proceedings of the ACM SIG- PLAN '91 Conference on Programming Language Design and lmplementatmn (June 26 28), 204-218.

[23]

GAO, G. R., WONO, W.-B., AND NING, Q. 1991b. A timed Petri-net model for fine-grain loop scheduling. Tech. Rep. ACAPS Technical Memo 18, School of Computer Science, McGill University, Montreal, Canada, H3A 2A7, January.

[24]

GOViNDARAJAN, R., ALTMAN, E. R., AND GAG, G. R. 1994. Minimizing register requirements under resource constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarehitecture, to appear.

[25]

Hsu, P. Y. T. 1986. Highly concurrent scalar processing. University of Illinois, Urbana- Champaign, Urbana-Champaign, IL, Ph.D. Thesis.

[26]

Hsu, P. Y. T. AND DAVIDSON, E.S. 1986. Highly concurrent scalar processing. In Proceedings of the Thirteenth Annual International Symposium on Computer Architecture, 386-395.

[27]

HUFF, R. A. 1993. Lifetime-sensitive modulo scheduling. In Conference Record of SIGPLAN Programming Language and Design Implementation (Albuquerque, N.M., June). ACM, New York, 258-267.

[28]

JONES, R.B. 1991. Constrained software pipelining. Utah State University, Dept. of Computer Science, Logan, UT, Master's Thesis, Sept.

[29]

JONES, R. B. AND ALLAN, V. H. 1990. Software pipelining: A comparison and improvement. In Proceedings of the 23rd International Symposium and Workshop on Mzcroprogrammmg and Mlcroarchztecture (MICRO-23) (Orlando, FL, Nov. 27-29). IEEE Computer Society Press, 46-56.

[30]

KUCK, D. J. AND PADUA, D.A. 1979. High-speed muir(processors and their compilers. In Proceedings of the 1979 International Conference on Parallel Processing, 5-16.

[31]

LAM, M. S. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation (Atlanta, GA, June), 318-328.

[32]

LAM, M. S. 1987. A systolic array optimizing compiler. Carnegie Mellon University, Dept. of Computer Science, Pittsburgh, PA, Ph.D. Thesis.

[33]

MAHLKE, S. A., LIN, D. C., CHEN, W. Y., HANK, R. E., AND BRINGMANN, a. h. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25) (Portland, OR, Dec. i 4), 45-54.

[34]

MATETI, P. ANU DEO, N. 1976. On algorithms for enumerating all circuits of a graph. SlAM J. Comput. 5, 1, 990-999.

[35]

NAKATANL T. AND EBCIOSLU, K. 1990. Using a lookahead window in compaction-based parallel(zing compfier. In Proceedings of the 23rd Microprogramming Workshop (MICRO-23) (Orlando, FL, Nov.). IEEE Computer Society Press, 57 68.

[36]

NAKATANI, T. AND EBCIO~LU, K. 1989. "Combining'' as a compilation technique for VLIW architectures. In Proceedings of the 22nd Microprogramming Workshop (MICRO-22) (Dublin, Ireland, Aug.), ACM, New York, 43 55.

[37]

NICOLAU, A. AND POTASMAN, R. 1990. An environment for the development of microcode for pipelined architectures. In Proceedings of the 23rd Symposium and Workshop on Microprogramming and Microarchitecture (Orlando, FL, Nov.), 69-79.

[38]

O'NEILL, M. R. 1994. Software pipelining with stochastic search algorithms. Utah State University, Dept. of Computer Science, Logan, UT, Master's Thesis.

[39]

RAJAGOPALAN, M. AND ALLAN, V.H. 1994. Specification of software pipelinmg using Petri nets Int. J. Parallel Process 3, 22, 279-307.

[40]

RAU, B. R 1994 Iterative modulo scheduhng' An algorithm for software plpelined loops In Proceedmgs of Micro-27, The 27th Annual Internatmnal Symposium on Mzcroarchttecture (San Jose, CA, Nov 29-Dee. 2). ACM, New York, 63 74

[41]

RAU, B. R. AND FISHER, J.A. 1993. Instructmnlevel parallel processing: History, overview, and perspective, d. Supercomputmg 7, 9-50.

[42]

RA~r, B. R., LEE, M, TiRUMALAi, P. P., AND SCltLmXlSKER, M. S 1992. Register allocation for modulo scheduled loops' Strategies, algorithms, and heuristics, in Proceedings of the ACM SIGPLAN '92 Conference on Programmmg Language Deszgn and {mplementatmn (San Francisco, CA, June), ACM, New York, 283 299.

[43]

RAiL B. R., SCHLANSKER, M S, AND TIRUMALAI, P. P 1992. Code generation schema for modulo scheduled loops. In Proceedings of Micro-25, The 25th Annual International Symposzum on Microarchztecture (Dec.). IEEE Computer Society Press, 158-169

[44]

RAU, B. R., YEN, D. W L., YEN, W., AND TOWLE, R.A. 1989. The Cydra 5 departmental supercomputer: Demgn philosophies, deciaons, and trade-offs IEEE Comput. (Jan.), 12-25.

[45]

RAU, B. R. AND GLAESER, C. D. 1981 Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing In Proceedings of the Fourteenth Annual Workshop of Microprogramming (Oct.), 183 198

[46]

RAU, B. R., GLAESER, C. D., AND GREENWALT, E. M. 1982. Architectural support for the efficient generation of code for horizontal architectures. In Proceedings of the Symposzum on Architectural Support ~or Programmzng Languages and Operatzng Systems (March), 96-99.

[47]

SMrrH, H. F. 1987. Data Structures Form and Function Harcourt Brace Jovanovich, San Diego, CA

[48]

Su, B., DING, S., WANG, J., AND XIA, J. 1987. GURPR--A method for global software pipelining. In Proceedings of the 20th Mmroprogramming Workshop (MICRO-20) (Colorado Springs, CO, Dec ), 97 105.

[49]

Su, B, D~NG, S., AND XIA, J. 1986. URPR An extension of URCR for software pipehnmg. In Proceedings of the 19th M~croprogramnnng Workshop (MICRO-19) (New York, Oct.), 104-108.

[50]

TAR JAN, R. 1972. Depth-first search and linear graph algorithms. SIAM d. Comput. 1, 2 (June), 146-160

[51]

TmRNAN, J. C. 1970. An efficient search algorithm to find the elementary circuits of a graph. Commun. ACM 13, 1, 722-726.

[52]

TmUMALAi, P., LEE, M., AND SCHLANSK~R, M. S. 1990. Parallelization of loops with exits on pipelined architectures. In Proceedings of SuperComputzng '90 (Amsterdam, Nov.), ACM, New York, 200-212

[53]

T()KORO, M, TAMURA, E., TAKASE, K., AND TAMARU, K. 1977. An approach to microprogTam optimization considering resource occupancy and instruction fornmts. In Proceedings of the lOth Annual Workshop on Mwroprogrammmg (Niagara Falls, NY, Nov.), 92-108.

[54]

VEGDAHL, S. R. 1992. A dynamic-programming technique for compacting loops. In Proceedings the 25th Annual Internattonal Symposiura on Microarchltecture (MICRO-25) (Port}and, OR, Dec ). IEEE Computer Society Press, 180-188

[55]

VEGDAHL, S.R. 1982. Local code generation and compaction in optimizing microcode compilers. CarnegSe-Mellon University, Dept. of Computer Science, Pittsburgh, PA, Ph.D. Thesis

[56]

WARTER, N. J., HAAB, O. E., AND BOCKHAUS, J. W. 1992 Enhanced modulo scheduling for loops with conditional branches In Proceedings of the 25th Annual Internatmnal Symposzum on Microarchitecture (MICRO-25) (Portland, OR, Dec. I 4), IEEE Computer Society Press, 170 179

[57]

WARTER, N. J., M^HLKE, S. A, Hwu, W. M W., AND RAU, B. R 1993. Reverse if-conversion, in PLDI (Albuquerque, NM, June). ACM, New York, 290 299

[58]

WOLFE, M.J. 1990. Tzny--A Loop Restructuring Tool, User Manual Oregon Graduate Institute of Science and Technology, Beaverton, OR.

[59]

WOLFE, M J 1989 Optzm~zmg Supercompders for Supercomputers MIT Press, Cambridge, MA.

[60]

Wool), W.G. 1979. Global optimization of'microprograms through modular control constructs. In Proceedings of the 12th Mzcroprogrammzng Workshop IMICRO-12) (Hershey, PA, Nov.), I 6.

[61]

ZAKY, A. M. 1989. Effiemnt static scheduhng of loops on synchronous multiproeessors. Ohio State University, Dept. of Computer and Information Science, Columbus, OH, Ph.D. Thesis.

[62]

ZIMA, H. AND CHAPMAN, B. 1991. Supercompzlers for Parallel and Vector Computers ACM, New York.

Cited By

Arai MFukumoto NMurai H(2024)Introducing software pipelining for the A64FX processor into LLVMProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637093(1-6)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3636480.3637093
Lin ZMiao YXu GLi CSaarikivi OMaleki SYang F(2024)Efficient Schedule Construction for Distributed Execution of Large DNN ModelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346691335:12(2375-2391)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3466913
Lin ZMiao YXu GLi CSaarikivi OMaleki SYang F(2024)Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00067(803-816)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00067
Show More Cited By

Index Terms

Software pipelining
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Single-dimension software pipelining for multidimensional loops

Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. This paper proposes a three-step approach, called single-dimension software pipelining (SSP), to software pipeline ...
Software pipelining: a comparison and improvement
MICRO 23: Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture

Software pipelining can significantly increase the execution rate of loops. Each of the four major software pipelining algorithms takes a different approach to software pipelining. This paper discusses each method and explores some of the similarities ...
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
LCTES '01: Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems

The TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 27, Issue 3

Sept. 1995

166 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/212094

Issue’s Table of Contents

Copyright © 1995 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1995

Published in CSUR Volume 27, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

282
Total Citations
View Citations
3,891
Total Downloads

Downloads (Last 12 months)389
Downloads (Last 6 weeks)38

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Arai MFukumoto NMurai H(2024)Introducing software pipelining for the A64FX processor into LLVMProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637093(1-6)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3636480.3637093
Lin ZMiao YXu GLi CSaarikivi OMaleki SYang F(2024)Efficient Schedule Construction for Distributed Execution of Large DNN ModelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346691335:12(2375-2391)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3466913
Lin ZMiao YXu GLi CSaarikivi OMaleki SYang F(2024)Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00067(803-816)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00067
Birath BErnstsson ATinnerholm JKessler C(2024)High-Level Programming of FPGA-Accelerated Systems with Parallel PatternsInternational Journal of Parallel Programming10.1007/s10766-024-00770-352:4(253-273)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10766-024-00770-3
Shi BDai TZou SWei XLuo G(2024)ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRAEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_5(61-76)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_5
Raskar SMonsalve Diaz JApplencourt TKumaran KGao GVieira MCardellini VDi Marco ATuma P(2023)Implementation of Dataflow Software Pipelining for Codelet ModelProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583734(161-172)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583734
Brignone GLazarescu MLavagno L(2023)A DSP shared is a DSP earned: HLS Task-Level Multi-Pumping for High-Performance Low-Resource Designs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00089(551-557)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00089
Ebcioglu KSan I(2022)Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS CloudACM Transactions on Reconfigurable Technology and Systems10.1145/350769815:4(1-42)Online publication date: 8-Aug-2022
https://dl.acm.org/doi/10.1145/3507698
Yuan BZhu JMan XMa ZYin SWei SLiu L(2022)Dynamic-II Pipeline: Compiling Loops With Irregular Branches on Static-Scheduling CGRAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312134641:9(2929-2942)Online publication date: Sep-2022
https://doi.org/10.1109/TCAD.2021.3121346
Yang JLong XMa ZYang C(2022)An Automatic Pipeline Parallel Acceleration Framework for Neural Network Models on Heterogeneous Computing Platforms2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)10.1109/PRAI55851.2022.9904009(1154-1158)Online publication date: 19-Aug-2022
https://doi.org/10.1109/PRAI55851.2022.9904009
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents