Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Energy-efficient hardware data prefetching

Published: 01 February 2011 Publication History

Abstract

Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often help to improve performance, they increase energy consumption by as much as 30% in the memory system. This paper provides a detailed evaluation on the energy impact of hardware data prefetching and then presents a set of new energy-aware techniques to overcome prefetching energy overhead of such schemes. These include compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine that can reduce hardware prefetching related energy consumption by 7-11×. Combined with the effect of leakage energy reduction due to performance improvement, the total energy consumption for the memory system after the application of these techniques can be up to 12% less than the baseline with no prefetching.

References

[1]
A. J. Smith, "Sequential program prefetching in memory hierarchies," IEEE Computer, vol. 11, no. 12, pp. 7-21, Dec. 1978.
[2]
J. L. Baer and T. F. Chen, "An effictive on-chip preloading scheme to reduce data access penalty," in Proc. Supercomput., 1991, pp. 179-186.
[3]
A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based prefetching for linked data structures," in Proc. ASPLOS-VIII, Oct. 1998, pp. 115-126.
[4]
A. Roth and G. S. Sohi, "Effective jump-pointer prefetching for linked data structures," in Proc. ISCA-26, 1999, pp. 111-121.
[5]
R. Cooksey, S. Jourdan, and D. Grunwald, "A stateless content-directed data prefetching mechanism," in Proc. ASPLOS-X, 2002, pp. 279-290.
[6]
T. Mowry, "Tolerating latency through software controlled data prefetching," Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, CA, Mar. 1994.
[7]
M. H. Lipasti, W. J. Schmidt, S. R. Kunkel, and R. R. Roediger, "Spaid: Software prefetching in pointer- and call-intensive environments," in Proc. Micro-28, Nov. 1995, pp. 231-236.
[8]
C.-K. Luk and T. C. Mowry, "Compiler-based prefetching for recursive data structures," in Proc. ASPLOS-VII, Oct. 1996, pp. 222-233.
[9]
K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, "Memory-system design considerations for dynamically-scheduled processors," in Proc. ISCA-24, 1997, pp. 133-143.
[10]
T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," in Proc. ASPLOS-V, Oct. 1992, pp. 62-73.
[11]
J. Chen, Y. Dong, H. Yi, and X. Yang, "Power-aware software prefetching," Lecture Notes Comput. Sci., vol. 4523/2007, pp. 207-218, 2007.
[12]
D. Bernstein, D. Cohen, A. Freund, and D. E. Maydan, "Compiler techinques for data prefetching on the PowerPC," in Proc. PACT, Jun. 1995, pp. 19-26.
[13]
K. K. Chan, C. C. Hay, J. R. Keller, G. P. Kurpanek, F. X. Schumacher, and J. Zheng, "Design of the HP PA 7200 CPU," Hewlett-Packard J., vol. 47, no. 1, pp. 25-33, Feb. 1996.
[14]
G. Doshi, R. Krishnaiyer, and K. Muthukumar, "Optimizing software data prefetches with rotating registers," in Proc. PACT, Sep. 2001, pp. 257-267.
[15]
R. E. Kessler, "The alpha 21264 microprocessor," IEEE Micro, vol. 19, no. 12, pp. 24-36, Mar./Apr. 1999.
[16]
V. Santhanam, E. H. Gornish, and H. Hsu, "Data prefetching on the HP PA8000," in Proc. ISCA-24, May 1997.
[17]
M. K. Gowan, L. L. Biro, and D. B. Jackson, "Power considerations in the design of the alpha 21264 microprocessor," in Proc. DAC, Jun. 1998, pp. 726-731.
[18]
J. Montanaro, R. T. Witek, K. Anne, A. J. Black, E. M. Cooper, D. W. Dobberpuhl, P. M. Donahue, J. Eno, G. W. Hoeppner, D. Kruckemyer, T. H. Lee, P. C. M. Lin, L. Madden, D. Murray, M. H. Pearce, S. Santhanam, K. J. Snyder, R. Stephany, and S. C. Thierauf, "A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor," Digit. Techn. J. Digit. Equip. Corp., vol. 9, no. 1, pp. 49-62, 1997.
[19]
D. C. Burger and T. M. Austin, "The Simplescalar tool set, Version 2.0," Univ. Wisconsin, Madison, Tech. Rep. CS-TR-1997-1342, Jun. 1997.
[20]
R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. Lam, and J. L. Hennessy, "SUIF: A parallelizing and optimizing research compiler," Comput. Syst. Lab., Stanford Univ., Stanford, CA, Tech. Rep.CSL-TR-94-620, May 1994.
[21]
Y. Guo, S. Chheda, I. Koren, C. M. Krishna, and C. A. Moritz, "Energy characterization of hardware-based data prefetching," in Proc. Int. Conf. Comput. Des. (ICCD), Oct. 2004, pp. 518-523.
[22]
A. J. Smith, "Cache memories," ACM Comput. Surveys (CSUR), vol. 14, no. 3, pp. 473-530, 1982.
[23]
A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren, "Supporting dynamic data structures on distributed-memory machines," ACM Trans. Program. Lang. Syst., vol. 17, no. 2, pp. 233-263, Mar. 1995.
[24]
SPEC, "The standard performance evaluation corporation," 2000. {Online}. Available: http://www.spec.org
[25]
M. Bennaser and C. A. Moritz, "A step-by-step design and analysis of low power caches for embedded processors," presented at the Boston Area Arch. Workshop (BARC), Boston, MA, Jan. 2005.
[26]
M. Zhang and K. Asanovic, "Highly-associative caches for low-power processors," presented at the Kool Chips Workshop, Micro-33, Monterey, CA, Dec. 2000.
[27]
R. Ashok, S. Chheda, and C. A. Moritz, "Cool-mem: Combining statically speculative memory accessing with selective address translation for energy efficiency," in Proc. ASPLOS-X, 2002, pp. 133-143.
[28]
N. Azizi, A. Moshovos, and F. N. Najm, "Low-leakage asymmetriccell SRAM," in Proc. ISLPED, 2002, pp. 48-51.
[29]
Y. Guo, S. Chheda, I.Koren, C. M. Krishna, and C. A. Moritz, "Energyaware data prefetching for general-purpose programs," in Proc. Workshop Power-Aware Comput. Syst. (PACS'04) Micro-37, Dec. 2004, pp. 78-94.
[30]
Y. Guo, S. Chheda, and C. A. Moritz, "Runtime biased pointer reuse analysis and its application to energy efficiency," in Proc. Workshop Power-Aware Comput. Syst. (PACS) Micro-36, Dec. 2003, pp. 1-15.
[31]
Y. Guo, M. Bennaser, and C. A. Moritz, "PARE: A power-aware hardware data prefetching engine," in Proc. ISLPED, New York, 2005, pp. 339-344.
[32]
R. Rugina and M. Rinard, "Pointer analysis for multithreaded programs," in Proc. PLDI, Atlanta, GA, May 1999, pp. 77-90.
[33]
C.-K. Luk, R. Muth, H. Patil, R. Weiss, P. G. Lowney, and R. Cohn, "Profile-guided post-link stride prefetching," in Proc. 16th Int. Conf. Supercomput. (ICS), Jun. 2002, pp. 167-178.
[34]
Y. Wu, "Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching," in Proc. PLDI, C. Norris and J. B. Fenwick, Jr., Eds., Jun. 2002, pp. 210-221.
[35]
T. M. Chilimbi and M. Hirzel, "Dynamic hot data stream prefetching for general-purpose programs," in Proc. PLDI, C. Norris and J. B. Fenwick, Jr., Eds., Jun. 2002, pp. 199-209.
[36]
T. Inagaki, T. Onodera, K. Komatsu, and T. Nakatani, "Stride prefetching by dynamically inspecting objects," in Proc. PLDI, Jun. 2003, pp. 269-277.
[37]
V. Srinivasan, G. S. Tyson, and E. S. Davidson, "A static filter for reducing prefetch traffic," Univ. Michigan, Ann Arbor, Tech. Rep. CSE-TR-400-99, 1999.
[38]
Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems, "Guided region prefetching: A cooperative hardware/software approach," in Proc. ISCA, Jun. 2003, pp. 388-398.
[39]
A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi, "JETTY: Filtering snoops for reduced energy consumption in smp servers," in Proc. HPCA-7, 2001, p. 85.
[40]
B. Ganesan, "Introduction to multi-core," presented at the Intel-FAER Series Lectures Comput. Arch., Bangalore, India, 2007.

Cited By

View all
  • (2021)Compiler-Assisted Data Streaming for Regular Code StructuresIEEE Transactions on Computers10.1109/TC.2020.299030270:3(483-494)Online publication date: 1-Mar-2021
  • (2021)Unlimited vector extension with data streaming supportProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00025(209-222)Online publication date: 14-Jun-2021
  • (2019)An Empirical Study on the Energy Efficiency of Matrix Transposition AlgorithmsComposability, Comprehensibility and Correctness of Working Software10.1007/978-3-031-42833-3_11(375-391)Online publication date: 17-Jun-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems  Volume 19, Issue 2
February 2011
174 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 February 2011
Revised: 02 April 2009
Received: 16 July 2008

Author Tags

  1. Compiler analysis
  2. compiler analysis
  3. data prefetching
  4. energy efficiency
  5. prefetch filtering
  6. prefetch hardware

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Compiler-Assisted Data Streaming for Regular Code StructuresIEEE Transactions on Computers10.1109/TC.2020.299030270:3(483-494)Online publication date: 1-Mar-2021
  • (2021)Unlimited vector extension with data streaming supportProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00025(209-222)Online publication date: 14-Jun-2021
  • (2019)An Empirical Study on the Energy Efficiency of Matrix Transposition AlgorithmsComposability, Comprehensibility and Correctness of Working Software10.1007/978-3-031-42833-3_11(375-391)Online publication date: 17-Jun-2019
  • (2018)Utility Aware Snoozy Caches for Energy Efficient Chip Multi-ProcessorsProceedings of the 2018 Great Lakes Symposium on VLSI10.1145/3194554.3194581(249-254)Online publication date: 30-May-2018
  • (2017)Adaptive In-Cache Streaming for Efficient Data ManagementIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.267140525:7(2130-2143)Online publication date: 1-Jul-2017
  • (2017)Adaptive Runtime-Assisted Block Prefetching on Chip-MultiprocessorsInternational Journal of Parallel Programming10.1007/s10766-016-0431-845:3(530-550)Online publication date: 1-Jun-2017
  • (2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
  • (2013)An adaptive energy-conserving strategy for parallel disk systemsFuture Generation Computer Systems10.1016/j.future.2012.05.00329:1(196-207)Online publication date: 1-Jan-2013
  • (2012)S/DCProceedings of the Conference on Design, Automation and Test in Europe10.5555/2492708.2492825(461-466)Online publication date: 12-Mar-2012
  • (2011)DRAM energy reduction by prefetching-based memory traffic clusteringProceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI10.1145/1973009.1973031(103-108)Online publication date: 2-May-2011

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media