article

Quantifying the Multi-Level Nature of Tiling Interactions

Authors:

Nicholas Mitchell,

Karin Högstedt,

Jeanne FerranteAuthors Info & Claims

International Journal of Parallel Programming, Volume 26, Issue 6

Pages 641 - 670

Published: 01 December 1998 Publication History

Abstract

Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures tend towards a hierarchy of memory and of parallelism. We have identified three common architectural scenarios where a single tiling choice could be improved by using information from multiple levels in concert. For each scenario, we derive multi-level cost functions which guide the optimal choice of tile size and shape, and quantify the improvement gained. We give both analysis and simulation results to support our points.

References

[1]

1. Michael E. Wolf and Monica S. Lam, A data locality optimizing algorithm, Progr. Lang. Design Implementation (1991).

[2]

2. Steve Carr and Ken Kennedy, Compiler blockability of numerical algorithms, J. Supercomputing , pp. 114-124 (November 1992).

Digital Library

[3]

3. Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng, Compiler optimizations for improving data locality, Sixth Int'l. Conf. Archit. Support Progr. Lang. Oper. Syst., San Jose, California, Oct. 1994.

[4]

4. Steve Carr and Ken Kennedy, Improving the ratio of memory operations to floatingpoint operations in loops, Trans. Progr. Lang. Syst. 16(6):1768-1810 (November 1994).

Digital Library

[5]

5. Corinne Ancourt and François Irigoin, Scanning polyhedra with DO loops, Principles and Practice of Parallel Progr., pp. 39-50 (April 1991).

[6]

6. Michael E. Wolf and Monica S. Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Trans. Parallel Distrib. Syst. 2(4):452-471 (1991).

Digital Library

[7]

7. Paul Feautrier, Some efficient solutions to the affine scheduling problem, Part I, one-dimensional time, IJPP 21(5):xx-xx (October 1992).

[8]

8. Wayne Kelly and William Pugh, A unifying framework for iteration reordering transformations, IEEE First Int'l. Conf. Algorithms and Architectures for Parallel Processing (April 1995).

[9]

9. Daniel Lavery and Wen-mei Hwu, Unrolling-based optimizations for modulo scheduling, 28th Int'l. Symp. Microarchit., pp. 126-141 (December 1995).

[10]

10. Stephanie Coleman and Kathryn S. McKinley, Tile size selection using cache organization and data layout, Progr. Lang. Design and Implementation (June 1995).

[11]

11. Vivek Sarkar, Guang R. Gao, and Shaohua Han, Locality analysis for distributed shared-memory multiprocessors, Lang. Compilers for Parallel Computing (1996).

[12]

12. Dennis Gannon and Ko-Yang Wang, Applying AI Techniques to Program Optimization for Parallel Computers, Chap. 12, McGraw Hill Co. (1989).

[13]

13. Michael E. Wolf, Dror Maydan, and Ding-Kai Chen, Combining loop transformations considering caches and scheduling, 29th Int'l. Symp. Microarchit. (December 1996).

[14]

14. Michael J. Wolfe, Iteration space tiling for memory hierarchies, Parallel Processing for Sci. Comput., pp. 357-361 (1987).

[15]

15. J. Ramanujam and P. Sadayappan, Tiling multidimensional iteration spaces for nonshared memory machines, Supercomputing (November 1991).

[16]

16. David A. Padua and Michael J. Wolfe, Advanced compiler optimizations for supercomputers, Commun. ACM 29(12):1184-1201 (December 1986).

Digital Library

[17]

17. Dennis Gannon, William Jalby, and Kyle Gallivan, Strategies for cache and local memory management by global program transformation, J. Parallel and Distrib. Comput., Vol. 5, No. 5 (October 1988).

Digital Library

[18]

18. François Irigoin and Rémi Triolet, Supernode partitioning, Principles of Progr. Lang., pp. 319-328 (January 1988).

[19]

19. Michael J. Wolfe, More iteration space tiling, Supercomputing, pp. 655-664 (1989).

[20]

20. Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf, The cache performance and optimizations of blocked algorithms, ASPLOS-IV, Palo Alto, California (April 1991).

[21]

21. Utpal Banerjee, Unimodular transformations of double loops, in Progr. Lang. Compilers for Parallel Computing, Irvine, California (August 1990).

[22]

22. Ken Kennedy and Kathryn S. McKinley, Optimizing for parallelism and data locality, Int'l. Conf. Supercomputing (July 1992).

[23]

23. Jeanne Ferrante, Vivek Sarkar, and Wedy Thrash, On estimating and enhancing cache effectiveness, Lang. Compilers for Parallel Computing (1991).

[24]

24. Anant Agarwal, David Kranz, and Venkat Natarajan, Automatic partitioning of parallel loops and data arrays for distributed shared memory multiprocessors, Int'l. Conf. Parallel Computing (1993).

[25]

25. Vivek Sarkar and Radhika Thekkath, A general framework for iteration-reordering loop transformations, Technical Summary, Progr. Lang. Design and Implementation (1992).

[26]

26. Steve Carr, Combining optimization for cache and instruction-level parallelism, PACT '96, pp. 238-247 (1996).

[27]

27. Ken Kennedy and Kathryn S. McKinley, Maximizing loop parallelism and improving data locality via loop fusion and distribution, Lang. Compilers for Parallel Computing (1993).

[28]

28. Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Int'l. Conf. Supercomputing (1997).

Digital Library

[29]

29. Larry Carter, Jeanne Ferrante, Susan Flynn Hummel, Bowen Alpern, and Kang Su Gatlin, Hierarchical tiling: A methodology for high performance, Technical Report CS96- 508, UCSD, Department of Computer Science and Engineering (November 1996).

[30]

30. Doug Burger and Todd Austin, The SimpleScalar architectural research tool set, Version 2.0, http://www.cs.wisc.edu/~mscalar/simplescalar.html.

[31]

31. Karin Högstedt, Larry Carter, and Jeanne Ferrante, Determining the idle time of a tiling, Principles of Progr. Lang. (1997).

[32]

32. Larry Carter, Jeanne Ferrante, and S. Flynn Hummel, Hierarchical tiling for improved superscalar performance, Int'l. Parallel Processing Symp. (April 1995).

Digital Library

Cited By

Bao BDing C(2013)Defensive loop tiling for shared cacheProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495008(1-11)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1109/CGO.2013.6495008
Ma WKrishnamoorthy SAgrawal G(2011)Practical loop transformations for tensor contraction expressions on multi-level memory hierarchiesProceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software10.5555/1987237.1987258(266-285)Online publication date: 26-Mar-2011
https://dl.acm.org/doi/10.5555/1987237.1987258
Yuki TRenganarayanan LRajopadhye SAnderson CEichenberger AO'Brien KMoshovos ASteffan GHazelwood KKaeli D(2010)Automatic creation of tile size selection modelsProceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1772954.1772982(190-199)Online publication date: 24-Apr-2010
https://dl.acm.org/doi/10.1145/1772954.1772982
Show More Cited By

Index Terms

Quantifying the Multi-Level Nature of Tiling Interactions

Recommendations

Multi-level tiling: M for the price of one
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation ...
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 26, Issue 6

December 1998

58 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 1998 Plenum Publishing Corporation.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 1998

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bao BDing C(2013)Defensive loop tiling for shared cacheProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495008(1-11)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1109/CGO.2013.6495008
Ma WKrishnamoorthy SAgrawal G(2011)Practical loop transformations for tensor contraction expressions on multi-level memory hierarchiesProceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software10.5555/1987237.1987258(266-285)Online publication date: 26-Mar-2011
https://dl.acm.org/doi/10.5555/1987237.1987258
Yuki TRenganarayanan LRajopadhye SAnderson CEichenberger AO'Brien KMoshovos ASteffan GHazelwood KKaeli D(2010)Automatic creation of tile size selection modelsProceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1772954.1772982(190-199)Online publication date: 24-Apr-2010
https://dl.acm.org/doi/10.1145/1772954.1772982
Renganarayana LBondhugula UDerisavi SEichenberger AO'Brien KPinfold W(2009)Compact multi-dimensional kernel extraction for register tilingProceedings of the Conference on High Performance Computing Networking, Storage and Analysis10.1145/1654059.1654105(1-12)Online publication date: 14-Nov-2009
https://dl.acm.org/doi/10.1145/1654059.1654105
Renganarayana LRajopadhye STeller P(2008)Positivity, posynomials and tile size selectionProceedings of the 2008 ACM/IEEE conference on Supercomputing10.5555/1413370.1413426(1-12)Online publication date: 15-Nov-2008
https://dl.acm.org/doi/10.5555/1413370.1413426
Zhang YChen YTang YKahaner D(2007)Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputersProceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing10.1145/1375783.1375809(115-125)Online publication date: 11-Nov-2007
https://dl.acm.org/doi/10.1145/1375783.1375809
Qasem AKennedy KEgan GMuraoka Y(2006)Profitable loop fusion and tiling using model-driven empirical searchProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183437(249-258)Online publication date: 28-Jun-2006
https://dl.acm.org/doi/10.1145/1183401.1183437
Krishnan SKrishnamoorthy SBaumgartner GLam CRamanujam JSadayappan PChoppella V(2006)Efficient synthesis of out-of-core algorithms using a nonlinear optimization solverJournal of Parallel and Distributed Computing10.1016/j.jpdc.2005.06.01766:5(659-673)Online publication date: 1-May-2006
https://dl.acm.org/doi/10.1016/j.jpdc.2005.06.017
Sahoo SKrishnamoorthy SPanuganti RSadayappan PKramer W(2005)Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction ExpressionsProceedings of the 2005 ACM/IEEE conference on Supercomputing10.1109/SC.2005.35Online publication date: 12-Nov-2005
https://dl.acm.org/doi/10.1109/SC.2005.35
Li XGarzarán M(2005)Optimizing matrix multiplication with a classifier learning systemProceedings of the 18th international conference on Languages and Compilers for Parallel Computing10.1007/978-3-540-69330-7_9(121-135)Online publication date: 20-Oct-2005
https://dl.acm.org/doi/10.1007/978-3-540-69330-7_9
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents