Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/335231.335246acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article
Free access

Optimized unrolling of nested loops

Published: 08 May 2000 Publication History

Abstract

In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.
Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

References

[1]
Michael J. Alexander, Mark W. Barley, Bruce R. Childers, Jack W. Davidson, and Sanjay Jinturkar. Memory bandwidth optimizations for wide-bus machines. Proceedings of the ~fith Hawaii International Conference on System Sciences, Wailea, Hawaii, pages 466-475, January 1993.
[2]
F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1-30. Prentice-Hall, 1972.
[3]
D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler Transformations for High-Performance Computing. A CM Computing Surveys, 26(4):345-420, December 1994.
[4]
Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons. Compiler Solutions for the Stale-Data and False-Sharing Problems. Technical report, IBM Santa Teresa Laboratory, April 1993. TR 03.466.
[5]
David Callahan, Steve Can-, and Ken Kennedy. Improving Register Allocation for Subscripted Variables. Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, White Plains, New York, pages 53-65, June 1990.
[6]
S. Can" and Y. Guan. Unroll-and-Jam Using Uniformly Generated Sets. Proceedings of MICRO-30, pages 349- 357, December 1997.
[7]
Steve Carr and Ken Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM TOPLAS, 16(4), November 1994.
[8]
Steve Cart and Ken Kennedy. Scalar Replacement in the Presence of Conditional Control Flow. Software-- Practice and Experience, (I):51-77, January 1994.
[9]
The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks. http://open.specbench.org/osg/cpu95/, 1997.
[10]
Jack W. Davidson and Sanjay Jinturkar. Aggressive Loop Unrolling in a Retargetable, Optimizing Cornprier. In Compiler construction. Proceedings of the 6th international conference. Held Apr. $~-$6, 1996 in Linkoping, Sweden., volume 1060 of Lecture Notes in Computer Science. Springer-Verlag, New York, 1996.
[11]
J. J. Dongarra and A. R. Hinds. Unrolling Loops in Fortran. Software - Practice and Exper/ence, 9(3):219- 226, March 1979.
[12]
Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash. On Estimating and Enhan~-xg Cache Effectiveness. Lecture Notes in Computer Science, (589):328-343, 1991. Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, Santa Clara, California, USA, August 1991. Edited by U. Banerjee, D. Gelernter, A. Nicolau, D. Padua.
[13]
J. A. Fisher, J. It. Ellis, J. C. Ruttenberg, and A. Nicolau. Parallel Processing: A Smart Compiler and a Dumb Machine. Proceedings of the A CM Symposium on Compiler Construction, pages 37 - 47, June 1984.
[14]
Iteese B. Jones and Vicki H. Allan. Software pipellning: an evaluation of enhanced pipelining. Proceedings of the ~t h annual international symposium on Microarchitecture, pages 82-92, December 1990.
[15]
Daniel M.Lavery and Wen-Mei W.Hwu. Unrollingbased optimizations for modulo scheduling. Proceedings of MICRO-S8, pages 327-337, December 1995.
[16]
T. C. Mowry. Tolerating Latency Through So, ware- Controlled Data Prefetching. Phi) thesis, Stanford University, March 1994.
[17]
Allan K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Rice University, May 1989. Rice COMP TR89-93.
[18]
B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. Proceedings of the 27th annual international symposium on Microarchitecture, San Jose, CA USA, pages 63-74, November 1994.
[19]
Vivek Sarkar. Automatic Partitioning of a Program Dependence Graph into Parallel Tasks. IBM Journal o} Research and Development, 35(5/6), 1991.
[20]
Vivek Sarkar. Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM Journal of Research and Development, 41(3), May 1997.
[21]
Vivek Sarkar and Barbara Simons. Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops. Video Lecture in University Video Communication's Distinguished Lecture Series IX, August 1994.
[22]
Vivek Sarkar and Radhika Thekkath. A General Framework for Iteration-iteordering Loop Transformations. Proceedings of the A CM SIGPLAN "9~ Conference on Programming Language Design and Implementation, pages 176-187, June 1992.
[23]
Bogong Su, Shiyuan Ding, Jian Wang, and Jinshi Xia. GURPR--a method for global software piplining. Proceedings of the ZOth annual international symposium on Microarch~tecture, pages 88-96, December 1986.
[24]
S. Weiss and J. E. Smith. A Study of Scalar Compilation Techniques for Pipelined Supercomputers. Proceedings of the Second International Conference on Architectural SUpport for Programming Language and Operating Systems (ASPLOS), pages 105-109, October 1987.
[25]
Michael E. Wolf and Monica S. Lam. A Data Locality Optimization Algorithm. Proceedings of the A CM SIG- PLAN Symposium on Programming Language Design and Implementation, pages 30-44, June 1991.
[26]
Michael J. Wolfe. Optimizing Supercompilersfor Supercomputers. Pitman, London and The MIT Press, Cambridge, Massachusetts, 1989. In the series, Research Monographs in Parallel and Distributed Computing.

Cited By

View all
  • (2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
  • (2024)Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoostJournal of Circuits, Systems and Computers10.1142/S0218126625500355Online publication date: 23-Sep-2024
  • (2021)Using machine learning to predict the code size impact of duplication heuristics in a dynamic compilerProceedings of the 18th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3475738.3480943(127-135)Online publication date: 29-Sep-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '00: Proceedings of the 14th international conference on Supercomputing
May 2000
347 pages
ISBN:1581132700
DOI:10.1145/335231
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2000

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

ICS00
Sponsor:
ICS00: International Conference on Supercomputing
May 8 - 11, 2000
New Mexico, Santa Fe, USA

Acceptance Rates

ICS '00 Paper Acceptance Rate 33 of 122 submissions, 27%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)245
  • Downloads (Last 6 weeks)27
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
  • (2024)Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoostJournal of Circuits, Systems and Computers10.1142/S0218126625500355Online publication date: 23-Sep-2024
  • (2021)Using machine learning to predict the code size impact of duplication heuristics in a dynamic compilerProceedings of the 18th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3475738.3480943(127-135)Online publication date: 29-Sep-2021
  • (2021)Optimization of the Himeno Benchmark for SX-Aurora TSUBASABenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_8(127-143)Online publication date: 2-Mar-2021
  • (2019)Multi-objective Exploration for Practical Optimization Decisions in Binary TranslationACM Transactions on Embedded Computing Systems10.1145/335818518:5s(1-19)Online publication date: 7-Oct-2019
  • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
  • (2018)Speeding up Iterative Polyhedral Schedule Optimization with Surrogate Performance ModelsACM Transactions on Architecture and Code Optimization10.1145/329177315:4(1-27)Online publication date: 19-Dec-2018
  • (2018)Cost-driven thread coarsening for GPU kernelsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243196(1-14)Online publication date: 1-Nov-2018
  • (2018)Fast-path loop unrolling of non-counted loops to enable subsequent compiler optimizationsProceedings of the 15th International Conference on Managed Languages & Runtimes10.1145/3237009.3237013(1-13)Online publication date: 12-Sep-2018
  • (2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media