Article

Free access

Optimized unrolling of nested loops

Author:

Vivek SarkarAuthors Info & Claims

ICS '00: Proceedings of the 14th international conference on Supercomputing

Pages 153 - 166

https://doi.org/10.1145/335231.335246

Published: 08 May 2000 Publication History

Abstract

In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.

Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

References

[1]

Michael J. Alexander, Mark W. Barley, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar. Memory bandwidth optimizations for wide-bus machines. Proceedings of the ~fith Hawaii International Conference on System Sciences, Wailea, Hawaii, pages 466-475, January 1993.

[2]

F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1-30. Prentice-Hall, 1972.

[3]

D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler Transformations for High-Performance Computing. A CM Computing Surveys, 26(4):345-420, December 1994.

Digital Library

[4]

Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons. Compiler Solutions for the Stale-Data and False-Sharing Problems. Technical report, IBM Santa Teresa Laboratory, April 1993. TR 03.466.

[5]

David Callahan, Steve Can-, and Ken Kennedy. Improving Register Allocation for Subscripted Variables. Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, New York, pages 53-65, June 1990.

Digital Library

[6]

S. Can" and Y. Guan. Unroll-and-Jam Using Uniformly Generated Sets. Proceedings of MICRO-30, pages 349- 357, December 1997.

Digital Library

[7]

Steve Carr and Ken Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM TOPLAS, 16(4), November 1994.

Digital Library

[8]

Steve Cart and Ken Kennedy. Scalar Replacement in the Presence of Conditional Control Flow. Software-- Practice and Experience, (I):51-77, January 1994.

Digital Library

[9]

The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks. http://open.specbench.org/osg/cpu95/, 1997.

[10]

Jack W. Davidson and Sanjay Jinturkar. Aggressive Loop Unrolling in a Retargetable, Optimizing Cornprier. In Compiler construction. Proceedings of the 6th international conference. Held Apr. $~-$6, 1996 in Linkoping, Sweden., volume 1060 of Lecture Notes in Computer Science. Springer-Verlag, New York, 1996.

Digital Library

[11]

J. J. Dongarra and A. R. Hinds. Unrolling Loops in Fortran. Software - Practice and Exper/ence, 9(3):219- 226, March 1979.

[12]

Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash. On Estimating and Enhan~-xg Cache Effectiveness. Lecture Notes in Computer Science, (589):328-343, 1991. Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, Santa Clara, California, USA, August 1991. Edited by U. Banerjee, D. Gelernter, A. Nicolau, D. Padua.

Digital Library

[13]

J. A. Fisher, J. It. Ellis, J. C. Ruttenberg, and A. Nicolau. Parallel Processing: A Smart Compiler and a Dumb Machine. Proceedings of the A CM Symposium on Compiler Construction, pages 37 - 47, June 1984.

Digital Library

[14]

Iteese B. Jones and Vicki H. Allan. Software pipellning: an evaluation of enhanced pipelining. Proceedings of the ~t h annual international symposium on Microarchitecture, pages 82-92, December 1990.

Digital Library

[15]

Daniel M.Lavery and Wen-Mei W.Hwu. Unrollingbased optimizations for modulo scheduling. Proceedings of MICRO-S8, pages 327-337, December 1995.

Digital Library

[16]

T. C. Mowry. Tolerating Latency Through So, ware- Controlled Data Prefetching. Phi) thesis, Stanford University, March 1994.

Digital Library

[17]

Allan K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, May 1989. Rice COMP TR89-93.

Digital Library

[18]

B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. Proceedings of the 27th annual international symposium on Microarchitecture, San Jose, CA USA, pages 63-74, November 1994.

Digital Library

[19]

Vivek Sarkar. Automatic Partitioning of a Program Dependence Graph into Parallel Tasks. IBM Journal o} Research and Development, 35(5/6), 1991.

Digital Library

[20]

Vivek Sarkar. Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM Journal of Research and Development, 41(3), May 1997.

Digital Library

[21]

Vivek Sarkar and Barbara Simons. Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops. Video Lecture in University Video Communication's Distinguished Lecture Series IX, August 1994.

[22]

Vivek Sarkar and Radhika Thekkath. A General Framework for Iteration-iteordering Loop Transformations. Proceedings of the A CM SIGPLAN "9~ Conference on Programming Language Design and Implementation, pages 176-187, June 1992.

Digital Library

[23]

Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia. GURPR--a method for global software piplining. Proceedings of the ZOth annual international symposium on Microarch~tecture, pages 88-96, December 1986.

Digital Library

[24]

S. Weiss and J. E. Smith. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proceedings of the Second International Conference on Architectural SUpport for Programming Language and Operating Systems (ASPLOS), pages 105-109, October 1987.

[25]

Michael E. Wolf and Monica S. Lam. A Data Locality Optimization Algorithm. Proceedings of the A CM SIG- PLAN Symposium on Programming Language Design and Implementation, pages 30-44, June 1991.

Digital Library

[26]

Michael J. Wolfe. Optimizing Supercompilersfor Supercomputers. Pitman, London and The MIT Press, Cambridge, Massachusetts, 1989. In the series, Research Monographs in Parallel and Distributed Computing.

Digital Library

Cited By

Zhou AHu YXu XZhang C(2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3640337
Shi ZGao JGuan X(2024)Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoostJournal of Circuits, Systems and Computers10.1142/S0218126625500355Online publication date: 23-Sep-2024
https://doi.org/10.1142/S0218126625500355
Mosaner RLeopoldseder DStadler LMössenböck HKuchen HSinger J(2021)Using machine learning to predict the code size impact of duplication heuristics in a dynamic compilerProceedings of the 18th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3475738.3480943(127-135)Online publication date: 29-Sep-2021
https://dl.acm.org/doi/10.1145/3475738.3480943
Show More Cited By

Recommendations

Optimized Unrolling of Nested Loops

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact ...
A method for estimating optimal unrolling times for nested loops
ISPAN '97: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks

Loop unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their loops. Unrolling not only the innermost loop but also outer loops greatly expands the scope ...
Software Pipelining of Nested Loops
CC '01: Proceedings of the 10th International Conference on Compiler Construction

Software pipelining is a technique to improve the performance of a loop by overlapping the execution of several iterations. The execution of a software-pipelined loop goes through three phases: prolog, kernel, and epilog. Software pipelining works best ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '00: Proceedings of the 14th international conference on Supercomputing

May 2000

347 pages

ISBN:1581132700

DOI:10.1145/335231

Chairmen:
John Reynders
Los Alamos National Lab, Los Alamos, NM
,
Alex Veidenbaum
Univ. of California at Irvine, Irvine

Copyright © 2000 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2000

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ICS00

Sponsor:

SIGARCH

ICS00: International Conference on Supercomputing

May 8 - 11, 2000

New Mexico, Santa Fe, USA

Acceptance Rates

ICS '00 Paper Acceptance Rate 33 of 122 submissions, 27%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
1,611
Total Downloads

Downloads (Last 12 months)245
Downloads (Last 6 weeks)27

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhou AHu YXu XZhang C(2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3640337
Shi ZGao JGuan X(2024)Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoostJournal of Circuits, Systems and Computers10.1142/S0218126625500355Online publication date: 23-Sep-2024
https://doi.org/10.1142/S0218126625500355
Mosaner RLeopoldseder DStadler LMössenböck HKuchen HSinger J(2021)Using machine learning to predict the code size impact of duplication heuristics in a dynamic compilerProceedings of the 18th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3475738.3480943(127-135)Online publication date: 29-Sep-2021
https://dl.acm.org/doi/10.1145/3475738.3480943
Onodera AKomatsu KFujimoto SIsobe YSato MKobayashi H(2021)Optimization of the Himeno Benchmark for SX-Aurora TSUBASABenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_8(127-143)Online publication date: 2-Mar-2021
https://doi.org/10.1007/978-3-030-71058-3_8
Park SWu YLee JAupov AMahlke S(2019)Multi-objective Exploration for Practical Optimization Decisions in Binary TranslationACM Transactions on Embedded Computing Systems10.1145/335818518:5s(1-19)Online publication date: 7-Oct-2019
https://dl.acm.org/doi/10.1145/3358185
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813
Ganser SGrößlinger ASiegmund NApel SLengauer C(2018)Speeding up Iterative Polyhedral Schedule Optimization with Surrogate Performance ModelsACM Transactions on Architecture and Code Optimization10.1145/329177315:4(1-27)Online publication date: 19-Dec-2018
https://dl.acm.org/doi/10.1145/3291773
Barua PShirako JSarkar VEvripidou SStenström PO'Boyle M(2018)Cost-driven thread coarsening for GPU kernelsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243196(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243196
Leopoldseder DSchatz RStadler LRigger MWürthinger TMössenböck HTilevich EMössenböck H(2018)Fast-path loop unrolling of non-counted loops to enable subsequent compiler optimizationsProceedings of the 15th International Conference on Managed Languages & Runtimes10.1145/3237009.3237013(1-13)Online publication date: 12-Sep-2018
https://dl.acm.org/doi/10.1145/3237009.3237013
Ashouri AKillian WCavazos JPalermo GSilvano C(2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
https://dl.acm.org/doi/10.1145/3197978
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents