article

Compiler orchestrated prefetching via speculation and predication

Authors:

Rodric M. Rabbah,

Hariharan Sandanagobalane,

Mongkol Ekpanyapong,

Weng-Fai WongAuthors Info & Claims

ACM SIGOPS Operating Systems Review, Volume 38, Issue 5

Pages 189 - 198

https://doi.org/10.1145/1037949.1024416

Published: 07 October 2004 Publication History

Abstract

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.

References

[1]

S. Abraham and T. Johnson. Load sensitive scheduling. Personal Communication, HP Labs.

[2]

S. Abraham and B. R. Rau. Predicting load latencies using cache profiling. Technical Report HPL-94-110, HP Labs, Dec. 1994.

[3]

S. Abraham, R. Sugumar, B. R. Rau, and R. Gupta. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture, pages 139--152, Dec. 1993.

Digital Library

[4]

M. Annavaram, J. Patel, and E. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.

Digital Library

[5]

H. Argawal and J. Horgan. Dynamic program slicing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 246--256, June 1990.

Digital Library

[6]

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2000.

Digital Library

[7]

D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, 1991.

Digital Library

[8]

M. Charney and A. Reeves. Generalized correlation-based hardware prefetching. Technical Report EE-CEG-95-1, Cornell University, Feb. 1995.

[9]

J. Collins, D. Tullsen, D. Wang, and J. Shen. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001.

Digital Library

[10]

J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Long-range prefetching of delinquent loads. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.

Digital Library

[11]

G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J. Fisher. Deli: A new run-time control point. In Proceedings of the 35th Annual International Symposium on Microarchitecture, Nov. 2002.

Digital Library

[12]

Z. Du, C. Lim, X. Li, C. Yang, Q. Zhao, and T. Ngai. A cost-driven compilation framework for speculative parallelization of sequential programs. In Proceedings of the ACM SIGPLAN conference on Programming Language Design and Implementation, pages 71--81, 2004.

Digital Library

[13]

J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/textasciitilde markhill/DineroIV/.

[14]

J. Fu and J. Patel. Stride directed prefecthing in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 102--110, Dec. 1992.

Digital Library

[15]

D. Gallagher, W. Chen, S. Mahlke, J. Gyllenhaal, and W. Hwu. Dynamic memory disambiguation using the memory conflict buffer. In Proceedings of the 6th International Conference on Architecture Support for Programming Languages and Operating Systems, pages 183--195, 1994.

Digital Library

[16]

M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing loads: Enabling software to observe and react to memory behavior. Technical Report Technical Report No. CSL-TR-95-673, Stanford University, 1995.

Digital Library

[17]

M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing memory operations: providing memory performance feedback in modern processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996.

Digital Library

[18]

W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter, R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara, G. Haab, J. Holm, and D. Lavery. The superblock: An effective technique for VLIW and superscalar compilation. Journal of Supercomputing, Jan. 1993.

Digital Library

[19]

D. Joseph and D. Grunwald. Prefetching using M arkov predictors. IEEE Transactions on Computers, 48(2):121--133, Feb. 1999.

Digital Library

[20]

V. Kathail, M. Schlansker, and B. R. Rau. HPL-PD architecture specification: Version 1.1. Technical Report HPL-9380 (R.1), HP Labs, Feb. 2000.

[21]

D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Gikar, and J. Shen. Physical experimentation with prefetching helper threads on I ntel's H yper-T hreaded processors. In Proceedings of the Second Annual IEEE/ACM International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization, Mar. 2004.

Digital Library

[22]

A. Klaiber and H. Levy. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture, pages 43--53, 1991.

Digital Library

[23]

A. KleinOsowski and D. Lilja. MinneSPEC: A new SPEC benchmark workload for simulation-based computer architecture research. Computer Architecture Letters, 1, 2002.

Digital Library

[24]

S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 117--128, June 2002.

Digital Library

[25]

C. Luk and T. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222--233, Oct. 1996.

Digital Library

[26]

S. Mahlke, W. Chen, R. Bringmann, R.Hank, W. Hwu, B. R. Rau, and M. Schlansker. Sentinel scheduling: a model for compiler-controlled speculative execution. ACM Transactions on Computer Systems, 11(4), Nov. 1993.

Digital Library

[27]

S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 45--54, Dec. 1992.

Digital Library

[28]

T. Mowry, M. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, Oct. 1992.

Digital Library

[29]

Open Research Compiler for the I ntel I tanium. http://ipf-orc.sourceforge.net.

[30]

T. Ozawa, Y. Kimura, and S. Nishizaki. Cache miss heuristics and preloading techniques for general-purpose programs. In Proceedings of the 28th Annual International Symposium on Microarchitecture, Nov. 1995.

Digital Library

[31]

V. Panait, A. Sasturkar, and W. Wong. Static identification of delinquent loads. In Proceedings of the Second Annual IEEE/ACM International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization, Mar. 2004.

Digital Library

[32]

Performance application programming interface. http://icl.cs.utk.edu/papi/.

[33]

B. R. Rau. Iterative modulo scheduling. Technical Report Technical Report HPL-94-115, HP Labs, Nov. 1995.

[34]

A. Rogers and K. Li. Software support for speculative loads. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, 1992.

Digital Library

[35]

T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 42--53, 2000.

Digital Library

[36]

M. Smith, M. Lam, and M. Horowitz. Boosting beyond static scheduling in a superscalar processor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 344--354, 1990.

Digital Library

[37]

R. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 44-5:25--33, Jan. 1967.

Digital Library

[38]

Trimaran: An infrastructure for research in instruction level parallelism. http://www.trimaran.org.

[39]

S. Vanderwiel and D. Lilja. A compiler-assisted data prefetch controller. In Proceedings of the International Conference on Computer Design, pages 372--377, 1999.

Digital Library

[40]

Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 210--221, 2002.

Digital Library

[41]

C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.

Digital Library

Cited By

Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys (CSUR)10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Li HLiu KLiang TLi ZLu TYuan HXia YBao YChen MShan Y(2023)HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070986(1168-1181)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070986
Al Maruf HChowdhury MGavrilovska AZadok E(2020)Effectively prefetching remote memory with leapProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489204(843-857)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489204
Show More Cited By

Index Terms

Compiler orchestrated prefetching via speculation and predication
1. Hardware
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler orchestrated prefetching via speculation and predication
ASPLOS 2004

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program ...
Compiler orchestrated prefetching via speculation and predication
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program ...
Compiler orchestrated prefetching via speculation and predication
ASPLOS '04

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review

ACM SIGOPS Operating Systems Review Volume 38, Issue 5

ASPLOS '04

December 2004

283 pages

ISSN:0163-5980

DOI:10.1145/1037949

Issue’s Table of Contents

ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
October 2004
296 pages
ISBN:1581138040
DOI:10.1145/1024393
General Chair:
Shubu Mukherjee
Intel Corporation
,
Program Chair:
Kathryn S. McKinley
University of Texas at Austin

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004

Published in SIGOPS Volume 38, Issue 5

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
941
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys (CSUR)10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Li HLiu KLiang TLi ZLu TYuan HXia YBao YChen MShan Y(2023)HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070986(1168-1181)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070986
Al Maruf HChowdhury MGavrilovska AZadok E(2020)Effectively prefetching remote memory with leapProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489204(843-857)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489204
Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Atta ITong XSrinivasan VBaldini IMoshovos APrvulovic M(2015)Self-contained, accurate precomputation prefetchingProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830816(153-165)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830816
Khan MLaurenzanoy MMarsy JHagersten EBlack-Schaffer D(2015)AREPProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.35(367-378)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.35
Jiang CZhang YZheng W(2015)Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity ProcessorsProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.71(619-628)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1109/ICPP.2015.71
Khan MHagersten E(2014)Resource conscious prefetching for irregular applications in multicores2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)10.1109/SAMOS.2014.6893192(34-43)Online publication date: Jul-2014
https://doi.org/10.1109/SAMOS.2014.6893192
Khan MSandberg AHagersten E(2014)A Case for Resource Efficient Prefetching in MulticoresProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.19(101-110)Online publication date: 18-Oct-2014
https://dl.acm.org/doi/10.1109/ICPP.2014.19
Cong JHuang HLiu CZou YStok LDutt NHassoun S(2011)A reuse-aware prefetching scheme for scratchpad memoryProceedings of the 48th Design Automation Conference10.1145/2024724.2024937(960-965)Online publication date: 5-Jun-2011
https://dl.acm.org/doi/10.1145/2024724.2024937
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents