Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Software pipelining

Published: 01 September 1995 Publication History

Abstract

Utilizing parallelism at the instruction level is an important way to improve performance. Because the time spent in loop execution dominates total execution time, a large body of optimizations focuses on decreasing the time to execute each iteration. Software pipelining is a technique that reforms the loop so that a faster execution rate is realized. Iterations are executed in overlapped fashion to increase parallelism.
Let {ABC}n represent a loop containing operations A, B, C that is executed n times. Although the operations of a single iteration can be parallelized, more parallelism may be achieved if the entire loop is considered rather than a single iteration. The software pipelining transformation utilizes the fact that a loop {ABC}n is equivalent to A{BCA}n−1BC. Although the operations contained in the loop do not change, the operations are from different iterations of the original loop.
Various algorithms for software pipelining exist. A comparison of the alternative methods for software pipelining is presented. The relationships between the methods are explored and possibilities for improvement highlighted.

References

[1]
AHo, A V, SETm, R, AND ULLMAN, J D 1988 Compilers. Principles, Technzques, and Tools. Addison-Wesley, Reading, MA
[2]
AIKEN, A. 1988 Compaction-based parallehzation. Corne}l Umvermty, Dept of Computer Science, Ph D. Thesis. Ithaca, NY
[3]
AIKEN, A. AND NiCOL^U, A 1988a. A development enmronment for horizontal mmrocode IEEE Trans. Softu'. Eng. 14, 5 (May), 584 594
[4]
AIKEN, A AND NICOLAU, A. 1988b Optimal loop parallehzatlon in Proceedings o/the SIGPLAN '88 Conference on Programrnzng Language Design and Implementatton (Atlanta, CA, June), 308 317
[5]
AIKEN, A. AND NICOLAU, A. 1988e. Perfect pipehnmg: A new loop optimization techmque. In Proceedings of the 1988 European Symposzum on Programming, Springer Verlag Lecture Notes in Computer Sczence, #300 (Atlanta, CA, March), 221 235
[6]
AIKEN, A. ANB NICOLAU, A. 1990 A realistic resource-constrained software pipehnmg algorithm Tech. Rep. RJ 7743, IBM Research Division, San Jose, CA, October
[7]
ALLAN, V. H. m'~D O'NEILL, M.R. 1994. Software pipehning: A genetic algorithm approach. In PACT 94 (Montreal, Canada, Aug. 23 26). North-Holland, 2Mnsterdam, 311 314.
[8]
AI,LAN, V. H, RAJAGOPALAN, M, AND LEE, R M. 1993. Software plpelining- Petri net pacemaker. In Working Conference on Architectures and Cornp~latmn Techniques for Fine and Medium Gram Parallelism (Orlando, FL, Jan 0 22). North-Holland, Amsterdam, }5-26
[9]
BANERJEE, U., SHEN, S., KUCK, D J., AND TOWLE, R. A 1979 Time and parallel precentor bounds for Fortran-like loops. IEEE Trans Comput C-28, 9 (Sept), 660 670
[10]
BEAT5~, S.J. 1991. Instruction scheduling using genetic algorithms. Colorado State University, Fort Collins, CO, Ph.D Thesis.
[11]
BECK, G. R., YEN, D W L., ~ND ANDERSON, T L 1993 The Cydra 5 minicomputer: Architecture and implementation. In ft. Supercomput (May), 143-180
[12]
BRETEIgNiTZ. M., Jig. 1991. Architecture synthesis of }ugh-performance application-specific processors. Carnegie Mellon University, Pittsburgh, PA, Ph D Thesis.
[13]
CHANDY, K. M. AND KESSELMAN, C. 1991. Parallel programming in 2001. IEEE Softw. 8, 6 (Nov.), 11 20.
[14]
DAVIDSON, E. S. 1971. The design and control of pipelined function generators. In Proceedings of the 1971 Internatmnal IEEE Conference on Systems, Networks, and Computers' (Oaxtepec, Mexico, Jan.), 19-21.
[15]
DEHNERT, J. C., HSU, P. Y.-T., ANn BRATT, J. P. 1989. Overlapped loop support in the Cydra-5. In Proceedings of the Thtrd International Conference on Architectural Support for Programruing Languages and Operating Systems (Boston, MA, April), 26 38.
[16]
DEHNERT, J. C. AND TOWLE, R.A. 1993. Compiling foc the Cydra-5. In d. Supercomput. (May), 181 228.
[17]
EBCiO~LU, K. 1987. A compilation technique for software pipelining of loops with conditional jumps. In Proceedings of the 20th Microprogramming Workshop (MICRO-20) (Colorado Springs, CO, Dec.), 69-79.
[18]
EBCIO~LU, K. AND NAKATANI, T. 1990. A new compilation technique for parallel(zing loops with unpredictable branches on a VLIW architecture. In Languages and Compilers for Parallel Computing, D. Gelernter, Ed., MIT Press, Cambridge, MA, 213-229.
[19]
FERRANTE, J., OTTENSTEIN, K. J., AND WARREN, J. n. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July), 319-349.
[20]
FISHER, J. 1981. Trace scheduling: A technique for global microcode compaction. {EEE Trans. Comput. C-30, 7 (July), 478 490.
[21]
FREUDENBERGER, S. M., GROSS, T. R., AND LOWNEY, P.G. 1994. Avoidance and suppression of compensation code in a trace scheduler ACM Trans. Program. Lang. Syst. 16, 4 (July), 1156 1214.
[22]
GAO, G. R., WONG, W.-B., AND NING, Q. 1991a. A timed Petri-net model for fine-grain loop scheduling. In Proceedings of the ACM SIG- PLAN '91 Conference on Programming Language Design and lmplementatmn (June 26 28), 204-218.
[23]
GAO, G. R., WONO, W.-B., AND NING, Q. 1991b. A timed Petri-net model for fine-grain loop scheduling. Tech. Rep. ACAPS Technical Memo 18, School of Computer Science, McGill University, Montreal, Canada, H3A 2A7, January.
[24]
GOViNDARAJAN, R., ALTMAN, E. R., AND GAG, G. R. 1994. Minimizing register requirements under resource constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarehitecture, to appear.
[25]
Hsu, P. Y. T. 1986. Highly concurrent scalar processing. University of Illinois, Urbana- Champaign, Urbana-Champaign, IL, Ph.D. Thesis.
[26]
Hsu, P. Y. T. AND DAVIDSON, E.S. 1986. Highly concurrent scalar processing. In Proceedings of the Thirteenth Annual International Symposium on Computer Architecture, 386-395.
[27]
HUFF, R. A. 1993. Lifetime-sensitive modulo scheduling. In Conference Record of SIGPLAN Programming Language and Design Implementation (Albuquerque, N.M., June). ACM, New York, 258-267.
[28]
JONES, R.B. 1991. Constrained software pipelining. Utah State University, Dept. of Computer Science, Logan, UT, Master's Thesis, Sept.
[29]
JONES, R. B. AND ALLAN, V. H. 1990. Software pipelining: A comparison and improvement. In Proceedings of the 23rd International Symposium and Workshop on Mzcroprogrammmg and Mlcroarchztecture (MICRO-23) (Orlando, FL, Nov. 27-29). IEEE Computer Society Press, 46-56.
[30]
KUCK, D. J. AND PADUA, D.A. 1979. High-speed muir(processors and their compilers. In Proceedings of the 1979 International Conference on Parallel Processing, 5-16.
[31]
LAM, M. S. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation (Atlanta, GA, June), 318-328.
[32]
LAM, M. S. 1987. A systolic array optimizing compiler. Carnegie Mellon University, Dept. of Computer Science, Pittsburgh, PA, Ph.D. Thesis.
[33]
MAHLKE, S. A., LIN, D. C., CHEN, W. Y., HANK, R. E., AND BRINGMANN, a. h. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25) (Portland, OR, Dec. i 4), 45-54.
[34]
MATETI, P. ANU DEO, N. 1976. On algorithms for enumerating all circuits of a graph. SlAM J. Comput. 5, 1, 990-999.
[35]
NAKATANL T. AND EBCIOSLU, K. 1990. Using a lookahead window in compaction-based parallel(zing compfier. In Proceedings of the 23rd Microprogramming Workshop (MICRO-23) (Orlando, FL, Nov.). IEEE Computer Society Press, 57 68.
[36]
NAKATANI, T. AND EBCIO~LU, K. 1989. "Combining'' as a compilation technique for VLIW architectures. In Proceedings of the 22nd Microprogramming Workshop (MICRO-22) (Dublin, Ireland, Aug.), ACM, New York, 43 55.
[37]
NICOLAU, A. AND POTASMAN, R. 1990. An environment for the development of microcode for pipelined architectures. In Proceedings of the 23rd Symposium and Workshop on Microprogramming and Microarchitecture (Orlando, FL, Nov.), 69-79.
[38]
O'NEILL, M. R. 1994. Software pipelining with stochastic search algorithms. Utah State University, Dept. of Computer Science, Logan, UT, Master's Thesis.
[39]
RAJAGOPALAN, M. AND ALLAN, V.H. 1994. Specification of software pipelinmg using Petri nets Int. J. Parallel Process 3, 22, 279-307.
[40]
RAU, B. R 1994 Iterative modulo scheduhng' An algorithm for software plpelined loops In Proceedmgs of Micro-27, The 27th Annual Internatmnal Symposium on Mzcroarchttecture (San Jose, CA, Nov 29-Dee. 2). ACM, New York, 63 74
[41]
RAU, B. R. AND FISHER, J.A. 1993. Instructmnlevel parallel processing: History, overview, and perspective, d. Supercomputmg 7, 9-50.
[42]
RA~r, B. R., LEE, M, TiRUMALAi, P. P., AND SCltLmXlSKER, M. S 1992. Register allocation for modulo scheduled loops' Strategies, algorithms, and heuristics, in Proceedings of the ACM SIGPLAN '92 Conference on Programmmg Language Deszgn and {mplementatmn (San Francisco, CA, June), ACM, New York, 283 299.
[43]
RAiL B. R., SCHLANSKER, M S, AND TIRUMALAI, P. P 1992. Code generation schema for modulo scheduled loops. In Proceedings of Micro-25, The 25th Annual International Symposzum on Microarchztecture (Dec.). IEEE Computer Society Press, 158-169
[44]
RAU, B. R., YEN, D. W L., YEN, W., AND TOWLE, R.A. 1989. The Cydra 5 departmental supercomputer: Demgn philosophies, deciaons, and trade-offs IEEE Comput. (Jan.), 12-25.
[45]
RAU, B. R. AND GLAESER, C. D. 1981 Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing In Proceedings of the Fourteenth Annual Workshop of Microprogramming (Oct.), 183 198
[46]
RAU, B. R., GLAESER, C. D., AND GREENWALT, E. M. 1982. Architectural support for the efficient generation of code for horizontal architectures. In Proceedings of the Symposzum on Architectural Support ~or Programmzng Languages and Operatzng Systems (March), 96-99.
[47]
SMrrH, H. F. 1987. Data Structures Form and Function Harcourt Brace Jovanovich, San Diego, CA
[48]
Su, B., DING, S., WANG, J., AND XIA, J. 1987. GURPR--A method for global software pipelining. In Proceedings of the 20th Mmroprogramming Workshop (MICRO-20) (Colorado Springs, CO, Dec ), 97 105.
[49]
Su, B, D~NG, S., AND XIA, J. 1986. URPR An extension of URCR for software pipehnmg. In Proceedings of the 19th M~croprogramnnng Workshop (MICRO-19) (New York, Oct.), 104-108.
[50]
TAR JAN, R. 1972. Depth-first search and linear graph algorithms. SIAM d. Comput. 1, 2 (June), 146-160
[51]
TmRNAN, J. C. 1970. An efficient search algorithm to find the elementary circuits of a graph. Commun. ACM 13, 1, 722-726.
[52]
TmUMALAi, P., LEE, M., AND SCHLANSK~R, M. S. 1990. Parallelization of loops with exits on pipelined architectures. In Proceedings of SuperComputzng '90 (Amsterdam, Nov.), ACM, New York, 200-212
[53]
T()KORO, M, TAMURA, E., TAKASE, K., AND TAMARU, K. 1977. An approach to microprogTam optimization considering resource occupancy and instruction fornmts. In Proceedings of the lOth Annual Workshop on Mwroprogrammmg (Niagara Falls, NY, Nov.), 92-108.
[54]
VEGDAHL, S. R. 1992. A dynamic-programming technique for compacting loops. In Proceedings the 25th Annual Internattonal Symposiura on Microarchltecture (MICRO-25) (Port}and, OR, Dec ). IEEE Computer Society Press, 180-188
[55]
VEGDAHL, S.R. 1982. Local code generation and compaction in optimizing microcode compilers. CarnegSe-Mellon University, Dept. of Computer Science, Pittsburgh, PA, Ph.D. Thesis
[56]
WARTER, N. J., HAAB, O. E., AND BOCKHAUS, J. W. 1992 Enhanced modulo scheduling for loops with conditional branches In Proceedings of the 25th Annual Internatmnal Symposzum on Microarchitecture (MICRO-25) (Portland, OR, Dec. I 4), IEEE Computer Society Press, 170 179
[57]
WARTER, N. J., M^HLKE, S. A, Hwu, W. M W., AND RAU, B. R 1993. Reverse if-conversion, in PLDI (Albuquerque, NM, June). ACM, New York, 290 299
[58]
WOLFE, M.J. 1990. Tzny--A Loop Restructuring Tool, User Manual Oregon Graduate Institute of Science and Technology, Beaverton, OR.
[59]
WOLFE, M J 1989 Optzm~zmg Supercompders for Supercomputers MIT Press, Cambridge, MA.
[60]
Wool), W.G. 1979. Global optimization of'microprograms through modular control constructs. In Proceedings of the 12th Mzcroprogrammzng Workshop IMICRO-12) (Hershey, PA, Nov.), I 6.
[61]
ZAKY, A. M. 1989. Effiemnt static scheduhng of loops on synchronous multiproeessors. Ohio State University, Dept. of Computer and Information Science, Columbus, OH, Ph.D. Thesis.
[62]
ZIMA, H. AND CHAPMAN, B. 1991. Supercompzlers for Parallel and Vector Computers ACM, New York.

Cited By

View all
  • (2024)Introducing software pipelining for the A64FX processor into LLVMProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637093(1-6)Online publication date: 11-Jan-2024
  • (2024)Efficient Schedule Construction for Distributed Execution of Large DNN ModelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346691335:12(2375-2391)Online publication date: Dec-2024
  • (2024)Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00067(803-816)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 27, Issue 3
Sept. 1995
166 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/212094
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1995
Published in CSUR Volume 27, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. instruction level parallelism
  2. loop reconstruction
  3. optimization
  4. software pipelining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)389
  • Downloads (Last 6 weeks)38
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Introducing software pipelining for the A64FX processor into LLVMProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637093(1-6)Online publication date: 11-Jan-2024
  • (2024)Efficient Schedule Construction for Distributed Execution of Large DNN ModelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346691335:12(2375-2391)Online publication date: Dec-2024
  • (2024)Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00067(803-816)Online publication date: 2-Mar-2024
  • (2024)High-Level Programming of FPGA-Accelerated Systems with Parallel PatternsInternational Journal of Parallel Programming10.1007/s10766-024-00770-352:4(253-273)Online publication date: 1-Aug-2024
  • (2024)ImageMap: Enabling Efficient Mapping from Image Processing DSL to CGRAEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_5(61-76)Online publication date: 26-Aug-2024
  • (2023)Implementation of Dataflow Software Pipelining for Codelet ModelProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583734(161-172)Online publication date: 15-Apr-2023
  • (2023)A DSP shared is a DSP earned: HLS Task-Level Multi-Pumping for High-Performance Low-Resource Designs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00089(551-557)Online publication date: 6-Nov-2023
  • (2022)Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS CloudACM Transactions on Reconfigurable Technology and Systems10.1145/350769815:4(1-42)Online publication date: 8-Aug-2022
  • (2022)Dynamic-II Pipeline: Compiling Loops With Irregular Branches on Static-Scheduling CGRAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312134641:9(2929-2942)Online publication date: Sep-2022
  • (2022)An Automatic Pipeline Parallel Acceleration Framework for Neural Network Models on Heterogeneous Computing Platforms2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)10.1109/PRAI55851.2022.9904009(1154-1158)Online publication date: 19-Aug-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media