research-article

Open access

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks

Authors:

Vivek Seshadri,

Samihan Yedkar,

Phillip B. Gibbons,

Michael A. Kozuch,

Todd C. MowryAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 51, Pages 1 - 22

https://doi.org/10.1145/2677956

Published: 09 January 2015 Publication History

Abstract

Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution.

First, we observe that over 95% of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority.

Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49%, and 6% on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.

References

[1]

Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In HPCA.

Digital Library

[2]

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In MICRO.

Digital Library

[3]

Susanne Albers and Markus Büttner. 2003. Integrated prefetching and caching in single and parallel disk systems. In SPAA.

Digital Library

[4]

AMD. 2012. AMD Phenom II processor model. Retrieved November 11, 2014 from http://www.amd.com/en-us/products/processors/desktop/phenom-ii. (2012).

[5]

Jean-Loup Baer and Tien-Fu Chen. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE TC (1995).

Digital Library

[6]

Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, and Jose F. Martinez. 2007. Scavenger: a new last level cache architecture with global block priority. In MICRO.

Digital Library

[7]

Pei Cao, Edward W. Felten, Anna R. Karlin, and Kai Li. 1995. A study ofintegrated prefetching and caching strategies. In SIGMETRICS.

Digital Library

[8]

Mainak Chaudhuri. 2009. Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. In MICRO.

Digital Library

[9]

Fredrik Dahlgren, Michel Dubois, and Per Stenström. 1995. Sequential hardware prefetching inshared-memory multiprocessors. IEEE TPDS.

Digital Library

[10]

Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In MICRO.

Digital Library

[11]

Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V. Veidenbaum. 2012. Improving cache managementpolicies using dynamic reuse distances. In MICRO.

Digital Library

[12]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In ISCA.

Digital Library

[13]

Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In MICRO.

Digital Library

[14]

Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching dystems. In HPCA.

[15]

Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro.

Digital Library

[16]

Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In ISCA.

Digital Library

[17]

Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: predicting and optimizing memory behavior. In ISCA.

Digital Library

[18]

Intel. 2006. Inside Intel Core microarchitecture and smart memory access. Intel White Paper.

[19]

Prabhat Jain, Srini Devadas, and Larry Rudolph. 2001. Controlling Cache Pollution in Prefetching with Software-assisted Cache Replacement. Technical Report CSG-462. Massachusetts Institute of Technology, Cambridge, MA.

[20]

Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: temporal locality aware (TLA) cache management policies. In MICRO.

Digital Library

[21]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference intervalprediction (RRIP). In ISCA.

Digital Library

[22]

Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro.

Digital Library

[23]

Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In ICCD.

[24]

Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jimenez. 2014. Improving cache performance using read-write partitioning. In HPCA.

[25]

Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In MICRO.

[26]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.

[27]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010b. Thread cluster memory scheduling: exploiting differences in memory access behavior. In MICRO.

Digital Library

[28]

An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In ISCA.

Digital Library

[29]

H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. 2007. IBM Power6 microarchitecture. IBM JRD.

Digital Library

[30]

Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In MICRO.

[31]

Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving memory bank-level parallelism in the presence of prefetching. In MICRO.

[32]

Wei-Fen Lin, Steven K. Reinhardt, and Doug Burger. 2001a. Reducing DRAM latencies with an integrated memory hierarchy design. In HPCA.

Digital Library

[33]

Wei-Fen Lin, Steven K. Reinhardt, Doug Burger, and Thomas R. Puzak. 2001b. Filtering superfluous prefetches using density vectors. In ICCD.

[34]

Kun Luo, Jayanth Gummaraju, and Manoj Franklin. 2001. Balancing throughput and fairness in SMT processors. In ISPASS.

[35]

Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In PACT.

Digital Library

[36]

Oracle. 2011. Oracle’s Sparc T4 server architecture. Oracle White Paper.

[37]

R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In SOSP.

Digital Library

[38]

Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: practical data compression for on-chip caches. In PACT.

Digital Library

[39]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA.

Digital Library

[40]

Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A case for MLP-aware cache replacement. In ISCA.

Digital Library

[41]

Kaushik Rajan and Govindarajan Ramaswamy. 2007. Emulating optimal replacement with a shepherd cache. In MICRO.

Digital Library

[42]

Vivek Seshadri. 2014. Source code for Mem-Sim. Retrieved November 11, 2014 from www.ece.cmu.edu/&sim;safari/tools.html. (2014).

[43]

Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: a unified mechanism to address both cache pollution and thrashing. In PACT.

Digital Library

[44]

André Seznec. 1993. A case for two-way skewed-associative caches. In ISCA.

Digital Library

[45]

Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In ASPLOS.

Digital Library

[46]

Jaewoong Sim, Jaekyu Lee, Moinuddin K. Qureshi, and Hyesoon Kim. 2012. FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion. In ISCA.

Digital Library

[47]

Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In ASPLOS.

Digital Library

[48]

Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA.

Digital Library

[49]

Hans Vandierendonck and André Seznec. 2011. Fairness metrics for multithreaded processors. IEEE Computer Architecture Letters (Jan. 2011).

Digital Library

[50]

VIA. 2005. VIA C7 Processor. Retrieved November 11, 2014 from http://www.via.com.tw/en/products/processors/c7/. (2005).

[51]

Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: prefetch-aware cache management for high performance caching. In MICRO.

Digital Library

[52]

Xiaotong Zhuang and Hsien-Hsin S. Lee. 2003. A hardware-based cache pollution filtering mechanism for aggressive prefetches. In ICPP.

Cited By

Jamet AVavouliotis GJiménez DAlvarez LCasas M(2024)A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00046(528-542)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00046
Panda B(2023)CLIP: Load Criticality based Data Prefetching for Bandwidth-constrained Many-core SystemsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614245(714-727)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614245
Deb DJose J(2023)ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCsACM Transactions on Embedded Computing Systems10.1145/360911322:5s(1-25)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3609113
Show More Cited By

Index Terms

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCs
Special Issue ESWEEK 2023
Data prefetching efficiently reduces the memory access latency in NUCA architectures as the Last Level Cache (LLC) is shared and distributed across multiple cores. But cache pollution generated by prefetcher reduces its efficiency by causing contention ...
Using the first-level caches as filters to reduce the pollution caused by speculative memory references

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...
CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores

Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 October 2014

Revised: 01 October 2014

Received: 01 February 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

50
Total Citations
View Citations
978
Total Downloads

Downloads (Last 12 months)185
Downloads (Last 6 weeks)31

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jamet AVavouliotis GJiménez DAlvarez LCasas M(2024)A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00046(528-542)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00046
Panda B(2023)CLIP: Load Criticality based Data Prefetching for Bandwidth-constrained Many-core SystemsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614245(714-727)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614245
Deb DJose J(2023)ZPP: A Dynamic Technique to Eliminate Cache Pollution in NoC based MPSoCsACM Transactions on Embedded Computing Systems10.1145/360911322:5s(1-25)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3609113
Yang HFang JCai MCai Z(2023)A Prefetch-Adaptive Intelligent Cache Replacement Policy Based on Machine LearningJournal of Computer Science and Technology10.1007/s11390-022-1573-338:2(391-404)Online publication date: 30-Mar-2023
https://doi.org/10.1007/s11390-022-1573-3
Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04657-3
Khan TZhang DSriraman ADevietti JPokam GLitz HKasikci BMartínez JDuato JJohn L(2021)RippleProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00063(734-747)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00063
Talati NMay KBehroozi AYang YKaszyk KVasiladiotis CVerma TLi LNguyen BSun JMorton JAhmadi AAustin TO'Boyle MMahlke SMudge TDreslinski R(2021)Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00061(654-667)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00061
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
BiTalebi HSafaei F(2021)LARA: Locality-aware resource allocation to improve GPU memory-access timeThe Journal of Supercomputing10.1007/s11227-021-03854-w77:12(14438-14460)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11227-021-03854-w
Deb DJose JPalesi M(2020)COPEACM Transactions on Design Automation of Electronic Systems10.1145/342814926:3(1-31)Online publication date: 31-Dec-2020
https://dl.acm.org/doi/10.1145/3428149
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents