Nothing Special   »   [go: up one dir, main page]

Skip to main content

Modulo Scheduling

  • Chapter
  • First Online:
Instruction Level Parallelism

Abstract

Loop parallelization, and particularly the parallelization of innermost loops, is the most critical aspect of any parallelizing compiler. Trace scheduling can be applied to loops, but has the disadvantage that loops must be unrolled to exploit parallelism between loop iterations, which can lead to code explosion and in general still does not exploit all of the parallelism available. In this chapter we discuss modulo scheduling, which was the first technique to address the scheduling of loops (both within and across iterations) directly and is still very widely used in practice. The original modulo scheduling technique applies only to loops where the loop body is a single basic block; we also present extensions to modulo scheduling that allow the technique to be applied to more general loops.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

eBook
USD 15.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In practice this means that we set the unrolling factor such that it is an even divisor of the number of iterations of the loop, if known.

  2. 2.

    The recurrences in the dependence graph can be enumerated using the algorithms in [Tie70, MD76]. RecII can also be computed in polynomial time via, for instance, the Bellman-Ford algorithm [Bel58, CLR90].

  3. 3.

    We follow the predicate notion proposed in [WBHS92].

  4. 4.

    The longer unrolled loop body may result in different register allocation that increases register pressure.

Bibliography

  1. T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 17(12):685–690, 1974.

    Article  MATH  Google Scholar 

  2. A. Aletá, J. M. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 150–159, Austin, TX, 2001.

    Google Scholar 

  3. E. R. Altman, R. Govindrajan, and G. R. Rao. Scheduling and mapping: software pipelining in the presence of hazards. In Proceedings of the SIGPLAN ’95 Conference on Programming Language Design and Implementation, 1995.

    Google Scholar 

  4. V. H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software pipelining. ACM Computing Surveys, 27(3):367–432, 1995.

    Google Scholar 

  5. J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, Austin, TX, January 1983.

    Google Scholar 

  6. R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87–90, 1958.

    MathSciNet  MATH  Google Scholar 

  7. A. Charlesworth. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Computer, 14(9):18–27, September 1981.

    Article  Google Scholar 

  8. J. M. Codina, J. Llosa, and A. Gonzalez. A comparative study of modulo scheduling techniques. In Proceedings of the 16th ACM International Conference on Supercomputing, UNC, Universitat Politechnica de Catalunya, June 2002.

    Google Scholar 

  9. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990.

    MATH  Google Scholar 

  10. C. Ding, S. Carr, and P. H. Sweany. Modulo scheduling with cache reuse information. In Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 1079–1083, 1997.

    Google Scholar 

  11. B. Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Montreal, Canada, August 1994.

    Google Scholar 

  12. J. C. Dehnert, P. Y. T. Hsu, and J. P. Bratt. Overlapped loop support in the Cydra 5. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), pages 26–38, Boston, MA, 1989.

    Google Scholar 

  13. A. E. Eichenberger and Edward S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. pages 338–349, 1995.

    Google Scholar 

  14. J. R. Ellis. Bulldog: A compiler for VLIW architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). PhD thesis, Dept. of Computer Science, Yale University, February 1985.

    Google Scholar 

  15. P. Feautrier. Array expansion. In Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, July 1988.

    Google Scholar 

  16. P. Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(1):23–52, February 1991.

    Article  MATH  Google Scholar 

  17. P. Feautrier. Fine-grain scheduling under resource constraints. In Proceedings of the Seventh Annual Workshop on Languages, Compilers and Compilers for Parallel Computers, Ithaca, NY, August 1994.

    Google Scholar 

  18. M. M. Fernandes, J. Llosa, and N. Topham. Distributed modulo scheduling. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, page 130, 1999.

    Google Scholar 

  19. R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th International Symposium of Microarchitecture, pages 85–94, San Jose, CA, 1994.

    Google Scholar 

  20. A. De Gloria, P. Faraboschi, and M. Olivieri. A non-deterministic scheduler for a software pipelining compiler. In Proceedings of the 25th International Symposium of Microarchitecture, pages 41–44, 1992.

    Google Scholar 

  21. F. Gasperoni and U. Schwiegelshohn. Generating close to optimum loop schedules on parallel processors. Parallel Processing Letters, 4(4):391–403, 1994.

    Article  MathSciNet  Google Scholar 

  22. P. S. Hsu. Highly Concurrent Scalar Processing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1986.

    Google Scholar 

  23. T. C. Hu. Parallel sequencing and assembly line problems. Operations Research, 9(6):841–848, 1961.

    Article  MathSciNet  Google Scholar 

  24. R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the SIGPLAN ’93 Conference on Programming Language Design and Implementation, pages 258–267, Albuquerque, NM, June 1993.

    Google Scholar 

  25. D. Jacobs, J. Prins, P. Siegel, and K. Wilson. Monte carlo techniques in code optimization. In Proceedings of the 15th Workshop on Microprogramming, pages 143–148, 1982.

    Google Scholar 

  26. A. Kejariwal and A. Nicolau. Modulo scheduling and loop pipelining. In D. Padua, editor, Encyclopedia of Parallel Computing, pages 1158–1173. Springer, 2011.

    Google Scholar 

  27. P. Kogge. The microprogramming of pipelined processors. In Proceedings of the Fourth International Symposium on Computer Architecture, pages 63–69, March 1977.

    Google Scholar 

  28. P. M. Kogge. The Architecture of Pipelined Computers. Hemisphere Publishing Corporation, 1981.

    MATH  Google Scholar 

  29. M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988.

    Google Scholar 

  30. E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, 1976.

    MATH  Google Scholar 

  31. J. Llosa and S. M. Freudenberger. Reduced code size modulo scheduling in the absence of hardware support. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, pages 99–110, Istanbul, Turkey, 2002.

    Google Scholar 

  32. D. M. Lavery and W. W. Hwu. Unrolling-based optimizations for modulo scheduling. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 327–337, Ann Arbor, MI, 1995.

    Google Scholar 

  33. D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 126–137, Paris, France, 1996.

    Google Scholar 

  34. S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, and W.-M. W. Hwu. Compiler code transformations for superscalar-based high performance systems. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pages 808–817, 1992.

    Google Scholar 

  35. P. Mateti and N. Deo. On algorithms for enumerating all circuits of a graph. SIAM Journal of Computing, 5(1):90–99, 1976.

    Article  MathSciNet  MATH  Google Scholar 

  36. M. C. Merten and W. W. Hwu. Modulo schedule buffers. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 138–149, Austin, TX, 2001.

    Google Scholar 

  37. D. Milicev and Z. Jovanovic. Predicated software pipelining technique for loops with conditions. In Proceedings of the 12th International Parallel Processing Symposium, Orlando, FL, March 1998.

    Google Scholar 

  38. E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 103–114, Dallas, TX, 1998.

    Google Scholar 

  39. J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. In Proceedings of the Third International Symposium on Computer Architecture, pages 159–164, 1976.

    Google Scholar 

  40. J. Park and M. Schlansker. On predicated execution. Technical Report 58–91, Hewlett Packard Laboratories, 1991.

    Google Scholar 

  41. B. R. Rau. Data flow and dependence analysis for instruction level parallelism. In Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, pages 236–250, 1992.

    Google Scholar 

  42. B. R. Rau. Iterative modulo scheduling. Technical Report HPL-94-115, Hewlett Packard Laboratories, November 1995.

    Google Scholar 

  43. C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez. Optimal scheduling strategies in a multiprocessor system. IEEE Transactions on Computers, C-21(2):137–146, February 1972.

    Article  MathSciNet  MATH  Google Scholar 

  44. C. R. Reeves. Modern Heuristic techniques for Combinatorial Problems. John Wiley and Sons, New York, NY, 1993.

    MATH  Google Scholar 

  45. B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183–198, Chatham, MA, December 1981.

    Google Scholar 

  46. J. Ruttenberg, G. R. Gao, A. Stoutchinin, and W. Lichtenstein. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the SIGPLAN ’96 Conference on Programming Language Design and Implementation, pages 1–11, Philadelphia, PA, 1996.

    Google Scholar 

  47. B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation schema for modulo scheduled loops. In Proceedings of the 25th International Symposium of Microarchitecture, pages 158–169, Portland, OR, 1992.

    Google Scholar 

  48. B. R. Rau, W. L. Yen, W. Yen, and A. Towle. The Cydra 5 departmental supercomputer: design philosophies, decisions and trade-offs. IEEE Computer, 22(1):12–35, January 1989.

    Article  Google Scholar 

  49. J. Sánchez and A. González. Cache sensitive modulo scheduling. In Proceedings of the 30th International Symposium of Microarchitecture, pages 338–348, Research Triangle Park, NC, 1997.

    Google Scholar 

  50. J. Sánchez and A. González. The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures. In Proceedings of the 2000 International Conference on Parallel Processing, page 555, 2000.

    Google Scholar 

  51. J. Sánchez and A. González. Instruction scheduling for clustered VLIW architectures. In Proceedings of the 13th International Symposium on System synthesis, pages 41–46, Madrid, Spain, 2000.

    Google Scholar 

  52. J. Sánchez and A. González. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 124–133, Monterey, CA, 2000.

    Google Scholar 

  53. J. Sánchez and A. González. Clustered modulo scheduling in a VLIW architecture with distributed cache. Journal of Instruction Level Parallelism, 3, 2001.

    Google Scholar 

  54. J. C. Tiernan. An efficient search algorithm to find the elementary circuits of a graph. Communications of the ACM, 13(12):722–726, 1970.

    Article  MathSciNet  MATH  Google Scholar 

  55. P. Tirumalai, M. Lee, and M. Schlansker. Parallelization of loops with exits on pipelined architectures. In Proceedings of the 1990 Conference on Supercomputing, pages 200–212, 1990.

    Google Scholar 

  56. R. F. Touzeau. A Fortran compiler for the FPS-164 scientific computer. In Proceedings of the 1984 SIGPLAN Symposium on Compiler construction, pages 48–57, 1984.

    Google Scholar 

  57. M. Tokoro, T. Takizuka, E. Tamura, and I. Yamaura. A technique of global optimization of microprograms. In Proceedings of the 11th Annual Workshop on Microprogramming, pages 41–50, Pacific Grove, CA, 1978.

    Google Scholar 

  58. N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramaniam. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th International Symposium of Microarchitecture, Portland, OR, December 1992.

    Google Scholar 

  59. J. Wang and C. Eisenbeis. Decomposed software pipelining. Technical Report RR-1838, INRIA-Rocquencourt, France, January 1993.

    Google Scholar 

  60. G. Wood. Global optimization of microprograms through modular control constructs. In Proceedings of the 12th Workshop on Microprogramming, pages 1–6, 1979.

    Google Scholar 

  61. N. J. Warter and N. Partamian. Modulo scheduling with multiple mutiple initiation intervals. In Proceedings of the 28th International Symposium of Microarchitecture, Ann Arbor, MI, November 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag US

About this chapter

Cite this chapter

Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Modulo Scheduling. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7797-7_6

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4899-7795-3

  • Online ISBN: 978-1-4899-7797-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics