Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Compiler orchestrated prefetching via speculation and predication

Published: 07 October 2004 Publication History

Abstract

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.

References

[1]
S. Abraham and T. Johnson. Load sensitive scheduling. Personal Communication, HP Labs.
[2]
S. Abraham and B. R. Rau. Predicting load latencies using cache profiling. Technical Report HPL-94-110, HP Labs, Dec. 1994.
[3]
S. Abraham, R. Sugumar, B. R. Rau, and R. Gupta. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual International Symposium on Microarchitecture, pages 139--152, Dec. 1993.
[4]
M. Annavaram, J. Patel, and E. Davidson. Data prefetching by dependence graph precomputation. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.
[5]
H. Argawal and J. Horgan. Dynamic program slicing. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 246--256, June 1990.
[6]
V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2000.
[7]
D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, 1991.
[8]
M. Charney and A. Reeves. Generalized correlation-based hardware prefetching. Technical Report EE-CEG-95-1, Cornell University, Feb. 1995.
[9]
J. Collins, D. Tullsen, D. Wang, and J. Shen. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001.
[10]
J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Long-range prefetching of delinquent loads. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.
[11]
G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J. Fisher. Deli: A new run-time control point. In Proceedings of the 35th Annual International Symposium on Microarchitecture, Nov. 2002.
[12]
Z. Du, C. Lim, X. Li, C. Yang, Q. Zhao, and T. Ngai. A cost-driven compilation framework for speculative parallelization of sequential programs. In Proceedings of the ACM SIGPLAN conference on Programming Language Design and Implementation, pages 71--81, 2004.
[13]
J. Edler and M. Hill. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/textasciitilde markhill/DineroIV/.
[14]
J. Fu and J. Patel. Stride directed prefecthing in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 102--110, Dec. 1992.
[15]
D. Gallagher, W. Chen, S. Mahlke, J. Gyllenhaal, and W. Hwu. Dynamic memory disambiguation using the memory conflict buffer. In Proceedings of the 6th International Conference on Architecture Support for Programming Languages and Operating Systems, pages 183--195, 1994.
[16]
M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing loads: Enabling software to observe and react to memory behavior. Technical Report Technical Report No. CSL-TR-95-673, Stanford University, 1995.
[17]
M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing memory operations: providing memory performance feedback in modern processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996.
[18]
W. Hwu, S. Mahlke, W. Chen, P. Chang, N. Warter, R. Bringmann, R. Ouellette, R. Hank, T. Kiyohara, G. Haab, J. Holm, and D. Lavery. The superblock: An effective technique for VLIW and superscalar compilation. Journal of Supercomputing, Jan. 1993.
[19]
D. Joseph and D. Grunwald. Prefetching using M arkov predictors. IEEE Transactions on Computers, 48(2):121--133, Feb. 1999.
[20]
V. Kathail, M. Schlansker, and B. R. Rau. HPL-PD architecture specification: Version 1.1. Technical Report HPL-9380 (R.1), HP Labs, Feb. 2000.
[21]
D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Gikar, and J. Shen. Physical experimentation with prefetching helper threads on I ntel's H yper-T hreaded processors. In Proceedings of the Second Annual IEEE/ACM International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization, Mar. 2004.
[22]
A. Klaiber and H. Levy. An architecture for software-controlled data prefetching. In Proceedings of the 18th International Symposium on Computer Architecture, pages 43--53, 1991.
[23]
A. KleinOsowski and D. Lilja. MinneSPEC: A new SPEC benchmark workload for simulation-based computer architecture research. Computer Architecture Letters, 1, 2002.
[24]
S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 117--128, June 2002.
[25]
C. Luk and T. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222--233, Oct. 1996.
[26]
S. Mahlke, W. Chen, R. Bringmann, R.Hank, W. Hwu, B. R. Rau, and M. Schlansker. Sentinel scheduling: a model for compiler-controlled speculative execution. ACM Transactions on Computer Systems, 11(4), Nov. 1993.
[27]
S. Mahlke, D. Lin, W. Chen, R. Hank, and R. Bringmann. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 45--54, Dec. 1992.
[28]
T. Mowry, M. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, Oct. 1992.
[29]
Open Research Compiler for the I ntel I tanium. http://ipf-orc.sourceforge.net.
[30]
T. Ozawa, Y. Kimura, and S. Nishizaki. Cache miss heuristics and preloading techniques for general-purpose programs. In Proceedings of the 28th Annual International Symposium on Microarchitecture, Nov. 1995.
[31]
V. Panait, A. Sasturkar, and W. Wong. Static identification of delinquent loads. In Proceedings of the Second Annual IEEE/ACM International Symposium on Code Generation and Optimization with Special Emphasis on Feedback-Directed and Runtime Optimization, Mar. 2004.
[32]
Performance application programming interface. http://icl.cs.utk.edu/papi/.
[33]
B. R. Rau. Iterative modulo scheduling. Technical Report Technical Report HPL-94-115, HP Labs, Nov. 1995.
[34]
A. Rogers and K. Li. Software support for speculative loads. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, 1992.
[35]
T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 42--53, 2000.
[36]
M. Smith, M. Lam, and M. Horowitz. Boosting beyond static scheduling in a superscalar processor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 344--354, 1990.
[37]
R. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 44-5:25--33, Jan. 1967.
[38]
Trimaran: An infrastructure for research in instruction level parallelism. http://www.trimaran.org.
[39]
S. Vanderwiel and D. Lilja. A compiler-assisted data prefetch controller. In Proceedings of the International Conference on Computer Design, pages 372--377, 1999.
[40]
Y. Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 210--221, 2002.
[41]
C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of 28th International Symposium on Computer Architecture, July 2001.

Cited By

View all
  • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys (CSUR)10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
  • (2023)HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070986(1168-1181)Online publication date: Feb-2023
  • (2020)Effectively prefetching remote memory with leapProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489204(843-857)Online publication date: 15-Jul-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 38, Issue 5
ASPLOS '04
December 2004
283 pages
ISSN:0163-5980
DOI:10.1145/1037949
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
    October 2004
    296 pages
    ISBN:1581138040
    DOI:10.1145/1024393
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2004
Published in SIGOPS Volume 38, Issue 5

Check for updates

Author Tags

  1. precomputation
  2. predicated execution
  3. prefetching
  4. speculation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys (CSUR)10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
  • (2023)HoPP: Hardware-Software Co-Designed Page Prefetching for Disaggregated Memory2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070986(1168-1181)Online publication date: Feb-2023
  • (2020)Effectively prefetching remote memory with leapProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489204(843-857)Online publication date: 15-Jul-2020
  • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
  • (2015)Self-contained, accurate precomputation prefetchingProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830816(153-165)Online publication date: 5-Dec-2015
  • (2015)AREPProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.35(367-378)Online publication date: 18-Oct-2015
  • (2015)Software-Based Lightweight Multithreading to Overlap Memory-Access Latencies of Commodity ProcessorsProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.71(619-628)Online publication date: 1-Sep-2015
  • (2014)Resource conscious prefetching for irregular applications in multicores2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)10.1109/SAMOS.2014.6893192(34-43)Online publication date: Jul-2014
  • (2014)A Case for Resource Efficient Prefetching in MulticoresProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.19(101-110)Online publication date: 18-Oct-2014
  • (2011)A reuse-aware prefetching scheme for scratchpad memoryProceedings of the 48th Design Automation Conference10.1145/2024724.2024937(960-965)Online publication date: 5-Jun-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media