Minimal Unroll Factor for Code Generation of Software Pipelining

Mounira Bachir¹,
Sid-Ahmed-Ali Touati¹,
Frederic Brault¹,
David Gregg¹ &
…
Albert Cohen¹

269 Accesses
Explore all metrics

Abstract

We address the problem of generating compact code from software pipelined loops. Although software pipelining is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number of values simultaneously alive) are sufficient for periodic register allocation (Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995; Hendren et al. in CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pages 176–191, London, UK, 1992). However, the best existing heuristic for controlling this code growth—modulo variable expansion (Lam in SIGPLAN Not 23(7):318–328, 1988)—may not apply the correct amount of loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills Eisenbeis et al. in PACT ’95: Proceedings of the IFIP WG10.3 working conference on Parallel Architectures and Compilation Techniques, pages 264–267, Manchester, UK, 1995. This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on register reuse. The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and study a dedicated algorithm for each problem. Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from STMicroelectronics, and validated on multiple benchmarks suites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Article 13 May 2019

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Article 09 January 2017

Improving on Linear Scan Register Allocation

Article 28 February 2018

References

Darte A., Schreiber R., Villard G.: Lattice-based memory allocation. IEEE Trans. Comput. 54(10), 1242–1257 (2005)
Article Google Scholar
Bachir, M., Touati, S.-A.-A., Cohen, A.: Post-pass periodic register allocation to minimise loop unrolling degree. In: LCTES ’08: Proceedings of the 2008 ACM SIGPLAN-SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 141–150. ACM, New York, NY, USA (2008)
Bachir, M., Touati, S.-A.-A., Cohen, A.: Decomposing meeting graph circuits to minimise kernel loop unrolling. In: 9th Workshop on Optimizations for DSP and Embedded Systems (ODES’11, associated with CGO), Chamonix, France, April (2011)
Bastoul, Cédric: Code generation in the polyhedral model is easier than you think. In: PACT’13 IEEE International Conference on Parallel Architecture and Compilation Techniques, pp. 7–16. Juan-les-Pins, France, September (2004)
Benoît Dupont de Dinechin: A unified software pipeline construction scheme for modulo scheduled loops. In: PaCT ’97: Proceedings of the 4th International Conference on Parallel Computing Technologies, pp. 189–200. Springer, London, UK (1997)
Dominique de W., Christine E., Sylvain L., Bruno M.: On a graph-theoretical model for cyclic register allocation. Discret. Appl. Math. 93(2–3), 191–203 (1999)
Google Scholar
Dehnert, J.C., Hsu Peter, Y.-T., Bratt J.P.: Overlapped loop support in the Cydra~5. In: ASPLOS-III: Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–38. ACM, New York, NY, USA (1989)
Dehnert J.C., Towle R.A.: Compiling for the Cydra 5. J. Supercomput. 7(1-2), 181–227 (1993)
Article Google Scholar
Eisenbeis, C., Lelait, S.: LoRA a Package for Loop Optimal Register Allocation. Technical report, INRIA, France (1999)
Eisenbeis, C., Lelait S., Marmol B.: The meeting graph: a new model for loop cyclic register allocation. In: PACT ’95: Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques, pp. 264–267. IFIP Working Group on Algol, Manchester, UK (1995)
Faraboschi,P., Brown, G., Fisher,J. A., Desoli,G., Homewood F.: Lx: a technology platform for customizable VLIW embedded processing. In: ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 203–213. ACM, New York, NY, USA (2000)
Fisher J.A., Faraboschi P., Young C.: Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers, San Francisco, CA (2005)
Google Scholar
Hendren, L., Gao, G. R., Altman, E. R., Mukerji, C.: A register allocation framework based on hierarchical cyclic interval graphs. In: CC ’92: Proceedings of the 4th International Conference on Compiler Construction, pp 176–191. Springer, London, UK (1992)
Howgrave-Graham, N.: Approximate integer common divisors. In: Cryptography and Lattices, International Conference (CaLC), volume 2146 of Lecture Notes in Computer Science, pp. 51–66 (2001)
Huff R.A.: Lifetime-sensitive modulo scheduling. SIGPLAN Not. 28(6), 258–267 (1993)
Article Google Scholar
Monica L.: Software pipelining: an effective scheduling technique for VLIW machines. SIGPLAN Not. 23(7), 318–328 (1988)
Article Google Scholar
Nicolau, A., Potasman, R., Wang, H.: Register allocation, renaming and their impact on fine-grain parallelism. In: Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pp. 218–235. Springer, London, UK (1992)
Parhi K.K., Messerschmitt D.G.: Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding. Trans. Comput. 40(2), 178–195 (1991)
Article Google Scholar
Rau B.R., Glaeser C.D.: Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. SIGMICRO Newsl. 12, 183–198 (1981)
Article Google Scholar
Rau, B.R., Schlansker, M.S., Timmalai P.P.: Code generation schema for modulo scheduled loops. In: Proceedings of the 25th Annual International Symposium on Microarchitecture, pp. 158–169. (1992)
Rong H., Douillet A., Gao G.R.: Register allocation for software pipelined multidimensional loops. ACM Trans. Program. Lang. Syst. 30(4), 1–68 (2008)
Article Google Scholar
Schlansker, M., Rau, B., Mahlke, S.: Achieving High Levels of Instruction-Level Parallelism with Reduced Hardware Complexity. Technical Report HPL-96-120, Hewlett Packard (1994)
ST231 Core and Instruction Set Architecture Manual (2005)
Touati, S.-A.-A., Brault, F., Deschinkel, K., de~Dinechin, B.D.: Efficient Spilling Reduction for Software Pipelined Loops in Presence of Multiple Register Types in Embedded VLIW Processors. ACM Trans. Embed. Comput. Syst. 10(4) (2011)
Touati S.-A.-A., Eisenbeis C.: Early periodic register allocation on ILP processors. Parallel Process. Lett. 14(2), 287–313 (2004)
Article MathSciNet Google Scholar
Vasilache N., Bastoul, C., Cohen, A.: Polyhedral code generation in the real world. In: Proceedings of the International Conference on Compiler Construction (ETAPS CC’06), LNCS 3923, pp. 185–201. Springer, Vienna, Austria (2006)
Wang J., Eisenbeis C., Jourdan M., Su B.: Decomposed software pipelining: a new perspective and a new approach. Int. J. Parallel Program. 22(3), 351–373 (1994)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Nice Sophia-Antipolis, Nice, France
Mounira Bachir, Sid-Ahmed-Ali Touati, Frederic Brault, David Gregg & Albert Cohen

Authors

Mounira Bachir
View author publications
You can also search for this author in PubMed Google Scholar
Sid-Ahmed-Ali Touati
View author publications
You can also search for this author in PubMed Google Scholar
Frederic Brault
View author publications
You can also search for this author in PubMed Google Scholar
David Gregg
View author publications
You can also search for this author in PubMed Google Scholar
Albert Cohen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sid-Ahmed-Ali Touati.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bachir, M., Touati, SAA., Brault, F. et al. Minimal Unroll Factor for Code Generation of Software Pipelining. Int J Parallel Prog 41, 1–58 (2013). https://doi.org/10.1007/s10766-012-0203-z

Download citation

Received: 10 May 2011
Accepted: 20 June 2012
Published: 17 July 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s10766-012-0203-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Improving on Linear Scan Register Allocation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Minimal Unroll Factor for Code Generation of Software Pipelining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Improving on Linear Scan Register Allocation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation