Modulo Scheduling

Alex Aiken⁵,
Utpal Banerjee⁶,
Arun Kejariwal⁷ &
…
Alexandru Nicolau⁸

744 Accesses
3 Altmetric

Abstract

Loop parallelization, and particularly the parallelization of innermost loops, is the most critical aspect of any parallelizing compiler. Trace scheduling can be applied to loops, but has the disadvantage that loops must be unrolled to exploit parallelism between loop iterations, which can lead to code explosion and in general still does not exploit all of the parallelism available. In this chapter we discuss modulo scheduling, which was the first technique to address the scheduling of loops (both within and across iterations) directly and is still very widely used in practice. The original modulo scheduling technique applies only to loops where the loop body is a single basic block; we also present extensions to modulo scheduling that allow the technique to be applied to more general loops.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

eBook: USD 15.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In practice this means that we set the unrolling factor such that it is an even divisor of the number of iterations of the loop, if known.
2.
The recurrences in the dependence graph can be enumerated using the algorithms in [Tie70, MD76]. RecII can also be computed in polynomial time via, for instance, the Bellman-Ford algorithm [Bel58, CLR90].
3.
We follow the predicate notion proposed in [WBHS92].
4.
The longer unrolled loop body may result in different register allocation that increases register pressure.

Bibliography

T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 17(12):685–690, 1974.
Article MATH Google Scholar
A. Aletá, J. M. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 150–159, Austin, TX, 2001.
Google Scholar
E. R. Altman, R. Govindrajan, and G. R. Rao. Scheduling and mapping: software pipelining in the presence of hazards. In Proceedings of the SIGPLAN ’95 Conference on Programming Language Design and Implementation, 1995.
Google Scholar
V. H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software pipelining. ACM Computing Surveys, 27(3):367–432, 1995.
Google Scholar
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, Austin, TX, January 1983.
Google Scholar
R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87–90, 1958.
MathSciNet MATH Google Scholar
A. Charlesworth. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Computer, 14(9):18–27, September 1981.
Article Google Scholar
J. M. Codina, J. Llosa, and A. Gonzalez. A comparative study of modulo scheduling techniques. In Proceedings of the 16th ACM International Conference on Supercomputing, UNC, Universitat Politechnica de Catalunya, June 2002.
Google Scholar
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990.
MATH Google Scholar
C. Ding, S. Carr, and P. H. Sweany. Modulo scheduling with cache reuse information. In Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 1079–1083, 1997.
Google Scholar
B. Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Montreal, Canada, August 1994.
Google Scholar
J. C. Dehnert, P. Y. T. Hsu, and J. P. Bratt. Overlapped loop support in the Cydra 5. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), pages 26–38, Boston, MA, 1989.
Google Scholar
A. E. Eichenberger and Edward S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. pages 338–349, 1995.
Google Scholar
J. R. Ellis. Bulldog: A compiler for VLIW architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). PhD thesis, Dept. of Computer Science, Yale University, February 1985.
Google Scholar
P. Feautrier. Array expansion. In Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, July 1988.
Google Scholar
P. Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(1):23–52, February 1991.
Article MATH Google Scholar
P. Feautrier. Fine-grain scheduling under resource constraints. In Proceedings of the Seventh Annual Workshop on Languages, Compilers and Compilers for Parallel Computers, Ithaca, NY, August 1994.
Google Scholar
M. M. Fernandes, J. Llosa, and N. Topham. Distributed modulo scheduling. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, page 130, 1999.
Google Scholar
R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th International Symposium of Microarchitecture, pages 85–94, San Jose, CA, 1994.
Google Scholar
A. De Gloria, P. Faraboschi, and M. Olivieri. A non-deterministic scheduler for a software pipelining compiler. In Proceedings of the 25th International Symposium of Microarchitecture, pages 41–44, 1992.
Google Scholar
F. Gasperoni and U. Schwiegelshohn. Generating close to optimum loop schedules on parallel processors. Parallel Processing Letters, 4(4):391–403, 1994.
Article MathSciNet Google Scholar
P. S. Hsu. Highly Concurrent Scalar Processing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1986.
Google Scholar
T. C. Hu. Parallel sequencing and assembly line problems. Operations Research, 9(6):841–848, 1961.
Article MathSciNet Google Scholar
R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the SIGPLAN ’93 Conference on Programming Language Design and Implementation, pages 258–267, Albuquerque, NM, June 1993.
Google Scholar
D. Jacobs, J. Prins, P. Siegel, and K. Wilson. Monte carlo techniques in code optimization. In Proceedings of the 15th Workshop on Microprogramming, pages 143–148, 1982.
Google Scholar
A. Kejariwal and A. Nicolau. Modulo scheduling and loop pipelining. In D. Padua, editor, Encyclopedia of Parallel Computing, pages 1158–1173. Springer, 2011.
Google Scholar
P. Kogge. The microprogramming of pipelined processors. In Proceedings of the Fourth International Symposium on Computer Architecture, pages 63–69, March 1977.
Google Scholar
P. M. Kogge. The Architecture of Pipelined Computers. Hemisphere Publishing Corporation, 1981.
MATH Google Scholar
M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988.
Google Scholar
E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, 1976.
MATH Google Scholar
J. Llosa and S. M. Freudenberger. Reduced code size modulo scheduling in the absence of hardware support. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, pages 99–110, Istanbul, Turkey, 2002.
Google Scholar
D. M. Lavery and W. W. Hwu. Unrolling-based optimizations for modulo scheduling. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 327–337, Ann Arbor, MI, 1995.
Google Scholar
D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 126–137, Paris, France, 1996.
Google Scholar
S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, and W.-M. W. Hwu. Compiler code transformations for superscalar-based high performance systems. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pages 808–817, 1992.
Google Scholar
P. Mateti and N. Deo. On algorithms for enumerating all circuits of a graph. SIAM Journal of Computing, 5(1):90–99, 1976.
Article MathSciNet MATH Google Scholar
M. C. Merten and W. W. Hwu. Modulo schedule buffers. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 138–149, Austin, TX, 2001.
Google Scholar
D. Milicev and Z. Jovanovic. Predicated software pipelining technique for loops with conditions. In Proceedings of the 12th International Parallel Processing Symposium, Orlando, FL, March 1998.
Google Scholar
E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 103–114, Dallas, TX, 1998.
Google Scholar
J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. In Proceedings of the Third International Symposium on Computer Architecture, pages 159–164, 1976.
Google Scholar
J. Park and M. Schlansker. On predicated execution. Technical Report 58–91, Hewlett Packard Laboratories, 1991.
Google Scholar
B. R. Rau. Data flow and dependence analysis for instruction level parallelism. In Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, pages 236–250, 1992.
Google Scholar
B. R. Rau. Iterative modulo scheduling. Technical Report HPL-94-115, Hewlett Packard Laboratories, November 1995.
Google Scholar
C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez. Optimal scheduling strategies in a multiprocessor system. IEEE Transactions on Computers, C-21(2):137–146, February 1972.
Article MathSciNet MATH Google Scholar
C. R. Reeves. Modern Heuristic techniques for Combinatorial Problems. John Wiley and Sons, New York, NY, 1993.
MATH Google Scholar
B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183–198, Chatham, MA, December 1981.
Google Scholar
J. Ruttenberg, G. R. Gao, A. Stoutchinin, and W. Lichtenstein. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the SIGPLAN ’96 Conference on Programming Language Design and Implementation, pages 1–11, Philadelphia, PA, 1996.
Google Scholar
B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation schema for modulo scheduled loops. In Proceedings of the 25th International Symposium of Microarchitecture, pages 158–169, Portland, OR, 1992.
Google Scholar
B. R. Rau, W. L. Yen, W. Yen, and A. Towle. The Cydra 5 departmental supercomputer: design philosophies, decisions and trade-offs. IEEE Computer, 22(1):12–35, January 1989.
Article Google Scholar
J. Sánchez and A. González. Cache sensitive modulo scheduling. In Proceedings of the 30th International Symposium of Microarchitecture, pages 338–348, Research Triangle Park, NC, 1997.
Google Scholar
J. Sánchez and A. González. The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures. In Proceedings of the 2000 International Conference on Parallel Processing, page 555, 2000.
Google Scholar
J. Sánchez and A. González. Instruction scheduling for clustered VLIW architectures. In Proceedings of the 13th International Symposium on System synthesis, pages 41–46, Madrid, Spain, 2000.
Google Scholar
J. Sánchez and A. González. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 124–133, Monterey, CA, 2000.
Google Scholar
J. Sánchez and A. González. Clustered modulo scheduling in a VLIW architecture with distributed cache. Journal of Instruction Level Parallelism, 3, 2001.
Google Scholar
J. C. Tiernan. An efficient search algorithm to find the elementary circuits of a graph. Communications of the ACM, 13(12):722–726, 1970.
Article MathSciNet MATH Google Scholar
P. Tirumalai, M. Lee, and M. Schlansker. Parallelization of loops with exits on pipelined architectures. In Proceedings of the 1990 Conference on Supercomputing, pages 200–212, 1990.
Google Scholar
R. F. Touzeau. A Fortran compiler for the FPS-164 scientific computer. In Proceedings of the 1984 SIGPLAN Symposium on Compiler construction, pages 48–57, 1984.
Google Scholar
M. Tokoro, T. Takizuka, E. Tamura, and I. Yamaura. A technique of global optimization of microprograms. In Proceedings of the 11th Annual Workshop on Microprogramming, pages 41–50, Pacific Grove, CA, 1978.
Google Scholar
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramaniam. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th International Symposium of Microarchitecture, Portland, OR, December 1992.
Google Scholar
J. Wang and C. Eisenbeis. Decomposed software pipelining. Technical Report RR-1838, INRIA-Rocquencourt, France, January 1993.
Google Scholar
G. Wood. Global optimization of microprograms through modular control constructs. In Proceedings of the 12th Workshop on Microprogramming, pages 1–6, 1979.
Google Scholar
N. J. Warter and N. Partamian. Modulo scheduling with multiple mutiple initiation intervals. In Proceedings of the 28th International Symposium of Microarchitecture, Ann Arbor, MI, November 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Alex Aiken
Department of Computer Science, University of California at Irvine, Irvine, CA, USA
Utpal Banerjee
MZ Inc., Palo Alto, USA
Arun Kejariwal
Center for Embedded Computer Systems, University of California at Irvine, Irvine, CA, USA
Alexandru Nicolau

Authors

Alex Aiken
View author publications
You can also search for this author in PubMed Google Scholar
Utpal Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Arun Kejariwal
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Nicolau
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Modulo Scheduling. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_6

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7797-7_6
Published: 30 November 2016
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7795-3
Online ISBN: 978-1-4899-7797-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics