Abstract
Loop parallelization, and particularly the parallelization of innermost loops, is the most critical aspect of any parallelizing compiler. Trace scheduling can be applied to loops, but has the disadvantage that loops must be unrolled to exploit parallelism between loop iterations, which can lead to code explosion and in general still does not exploit all of the parallelism available. In this chapter we discuss modulo scheduling, which was the first technique to address the scheduling of loops (both within and across iterations) directly and is still very widely used in practice. The original modulo scheduling technique applies only to loops where the loop body is a single basic block; we also present extensions to modulo scheduling that allow the technique to be applied to more general loops.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In practice this means that we set the unrolling factor such that it is an even divisor of the number of iterations of the loop, if known.
- 2.
- 3.
We follow the predicate notion proposed in [WBHS92].
- 4.
The longer unrolled loop body may result in different register allocation that increases register pressure.
Bibliography
T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Communications of the ACM, 17(12):685–690, 1974.
A. Aletá, J. M. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 150–159, Austin, TX, 2001.
E. R. Altman, R. Govindrajan, and G. R. Rao. Scheduling and mapping: software pipelining in the presence of hazards. In Proceedings of the SIGPLAN ’95 Conference on Programming Language Design and Implementation, 1995.
V. H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software pipelining. ACM Computing Surveys, 27(3):367–432, 1995.
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the Tenth Annual ACM Symposium on the Principles of Programming Languages, Austin, TX, January 1983.
R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87–90, 1958.
A. Charlesworth. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Computer, 14(9):18–27, September 1981.
J. M. Codina, J. Llosa, and A. Gonzalez. A comparative study of modulo scheduling techniques. In Proceedings of the 16th ACM International Conference on Supercomputing, UNC, Universitat Politechnica de Catalunya, June 2002.
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, MA, 1990.
C. Ding, S. Carr, and P. H. Sweany. Modulo scheduling with cache reuse information. In Proceedings of the Third International Euro-Par Conference on Parallel Processing, pages 1079–1083, 1997.
B. Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Montreal, Canada, August 1994.
J. C. Dehnert, P. Y. T. Hsu, and J. P. Bratt. Overlapped loop support in the Cydra 5. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-III), pages 26–38, Boston, MA, 1989.
A. E. Eichenberger and Edward S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. pages 338–349, 1995.
J. R. Ellis. Bulldog: A compiler for VLIW architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). PhD thesis, Dept. of Computer Science, Yale University, February 1985.
P. Feautrier. Array expansion. In Proceedings of the 2nd International Conference on Supercomputing, St. Malo, France, July 1988.
P. Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(1):23–52, February 1991.
P. Feautrier. Fine-grain scheduling under resource constraints. In Proceedings of the Seventh Annual Workshop on Languages, Compilers and Compilers for Parallel Computers, Ithaca, NY, August 1994.
M. M. Fernandes, J. Llosa, and N. Topham. Distributed modulo scheduling. In Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, page 130, 1999.
R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th International Symposium of Microarchitecture, pages 85–94, San Jose, CA, 1994.
A. De Gloria, P. Faraboschi, and M. Olivieri. A non-deterministic scheduler for a software pipelining compiler. In Proceedings of the 25th International Symposium of Microarchitecture, pages 41–44, 1992.
F. Gasperoni and U. Schwiegelshohn. Generating close to optimum loop schedules on parallel processors. Parallel Processing Letters, 4(4):391–403, 1994.
P. S. Hsu. Highly Concurrent Scalar Processing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1986.
T. C. Hu. Parallel sequencing and assembly line problems. Operations Research, 9(6):841–848, 1961.
R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the SIGPLAN ’93 Conference on Programming Language Design and Implementation, pages 258–267, Albuquerque, NM, June 1993.
D. Jacobs, J. Prins, P. Siegel, and K. Wilson. Monte carlo techniques in code optimization. In Proceedings of the 15th Workshop on Microprogramming, pages 143–148, 1982.
A. Kejariwal and A. Nicolau. Modulo scheduling and loop pipelining. In D. Padua, editor, Encyclopedia of Parallel Computing, pages 1158–1173. Springer, 2011.
P. Kogge. The microprogramming of pipelined processors. In Proceedings of the Fourth International Symposium on Computer Architecture, pages 63–69, March 1977.
P. M. Kogge. The Architecture of Pipelined Computers. Hemisphere Publishing Corporation, 1981.
M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988.
E. L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, 1976.
J. Llosa and S. M. Freudenberger. Reduced code size modulo scheduling in the absence of hardware support. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, pages 99–110, Istanbul, Turkey, 2002.
D. M. Lavery and W. W. Hwu. Unrolling-based optimizations for modulo scheduling. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 327–337, Ann Arbor, MI, 1995.
D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th International Symposium of Microarchitecture MICRO-29, pages 126–137, Paris, France, 1996.
S. A. Mahlke, W. Y. Chen, J. C. Gyllenhaal, and W.-M. W. Hwu. Compiler code transformations for superscalar-based high performance systems. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pages 808–817, 1992.
P. Mateti and N. Deo. On algorithms for enumerating all circuits of a graph. SIAM Journal of Computing, 5(1):90–99, 1976.
M. C. Merten and W. W. Hwu. Modulo schedule buffers. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 138–149, Austin, TX, 2001.
D. Milicev and Z. Jovanovic. Predicated software pipelining technique for loops with conditions. In Proceedings of the 12th International Parallel Processing Symposium, Orlando, FL, March 1998.
E. Nystrom and A. E. Eichenberger. Effective cluster assignment for modulo scheduling. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 103–114, Dallas, TX, 1998.
J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. In Proceedings of the Third International Symposium on Computer Architecture, pages 159–164, 1976.
J. Park and M. Schlansker. On predicated execution. Technical Report 58–91, Hewlett Packard Laboratories, 1991.
B. R. Rau. Data flow and dependence analysis for instruction level parallelism. In Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, pages 236–250, 1992.
B. R. Rau. Iterative modulo scheduling. Technical Report HPL-94-115, Hewlett Packard Laboratories, November 1995.
C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez. Optimal scheduling strategies in a multiprocessor system. IEEE Transactions on Computers, C-21(2):137–146, February 1972.
C. R. Reeves. Modern Heuristic techniques for Combinatorial Problems. John Wiley and Sons, New York, NY, 1993.
B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming, pages 183–198, Chatham, MA, December 1981.
J. Ruttenberg, G. R. Gao, A. Stoutchinin, and W. Lichtenstein. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the SIGPLAN ’96 Conference on Programming Language Design and Implementation, pages 1–11, Philadelphia, PA, 1996.
B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation schema for modulo scheduled loops. In Proceedings of the 25th International Symposium of Microarchitecture, pages 158–169, Portland, OR, 1992.
B. R. Rau, W. L. Yen, W. Yen, and A. Towle. The Cydra 5 departmental supercomputer: design philosophies, decisions and trade-offs. IEEE Computer, 22(1):12–35, January 1989.
J. Sánchez and A. González. Cache sensitive modulo scheduling. In Proceedings of the 30th International Symposium of Microarchitecture, pages 338–348, Research Triangle Park, NC, 1997.
J. Sánchez and A. González. The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures. In Proceedings of the 2000 International Conference on Parallel Processing, page 555, 2000.
J. Sánchez and A. González. Instruction scheduling for clustered VLIW architectures. In Proceedings of the 13th International Symposium on System synthesis, pages 41–46, Madrid, Spain, 2000.
J. Sánchez and A. González. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pages 124–133, Monterey, CA, 2000.
J. Sánchez and A. González. Clustered modulo scheduling in a VLIW architecture with distributed cache. Journal of Instruction Level Parallelism, 3, 2001.
J. C. Tiernan. An efficient search algorithm to find the elementary circuits of a graph. Communications of the ACM, 13(12):722–726, 1970.
P. Tirumalai, M. Lee, and M. Schlansker. Parallelization of loops with exits on pipelined architectures. In Proceedings of the 1990 Conference on Supercomputing, pages 200–212, 1990.
R. F. Touzeau. A Fortran compiler for the FPS-164 scientific computer. In Proceedings of the 1984 SIGPLAN Symposium on Compiler construction, pages 48–57, 1984.
M. Tokoro, T. Takizuka, E. Tamura, and I. Yamaura. A technique of global optimization of microprograms. In Proceedings of the 11th Annual Workshop on Microprogramming, pages 41–50, Pacific Grove, CA, 1978.
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramaniam. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th International Symposium of Microarchitecture, Portland, OR, December 1992.
J. Wang and C. Eisenbeis. Decomposed software pipelining. Technical Report RR-1838, INRIA-Rocquencourt, France, January 1993.
G. Wood. Global optimization of microprograms through modular control constructs. In Proceedings of the 12th Workshop on Microprogramming, pages 1–6, 1979.
N. J. Warter and N. Partamian. Modulo scheduling with multiple mutiple initiation intervals. In Proceedings of the 28th International Symposium of Microarchitecture, Ann Arbor, MI, November 1995.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag US
About this chapter
Cite this chapter
Aiken, A., Banerjee, U., Kejariwal, A., Nicolau, A. (2016). Modulo Scheduling. In: Instruction Level Parallelism. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7797-7_6
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7797-7_6
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7795-3
Online ISBN: 978-1-4899-7797-7
eBook Packages: Computer ScienceComputer Science (R0)