research-article

Open access

Global Dead-Block Management for Task-Parallel Programs

Authors:

Madhavan Manivannan,

Miquel Pericás,

Vassilis Papaefstathiou,

Per StenströmAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 3

Article No.: 33, Pages 1 - 25

https://doi.org/10.1145/3234337

Published: 04 September 2018 Publication History

All formats PDF

Abstract

Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. Existing dead-block prediction schemes take decisions locally for each cache level and do not efficiently manage the entire cache hierarchy. This article introduces runtime-orchestrated global dead-block management, in which static and dynamic information about tasks available to the runtime system is used to effectively detect and manage dead blocks across the cache hierarchy. In the proposed global management schemes, static information (e.g., when tasks start/finish, and what data regions tasks produce/consume) is combined with dynamic information to detect when/where blocks become dead. When memory regions are deemed dead at some cache level(s), all the associated cache blocks are evicted from the corresponding level(s). We extend the cache controllers at both private and shared cache levels to use the aforementioned information to evict dead blocks. The article does an extensive evaluation of both inclusive and non-inclusive cache hierarchies and shows that the proposed global schemes outperform existing local dead-block management schemes.

References

[1]

Jaume Abella, Antonio González, Xavier Vera, and Michael F. P. O’Boyle. 2005. IATAC: A smart predictor to turn-off L2 cache lines. ACM Trans. Archit. Code Optim. 2, 1 (Mar. 2005), 55--77.

Digital Library

[2]

Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, Xavier Martorell, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. 2015a. Runtime-guided management of hybrid memory hierarchies in multicore architectures. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation.

Digital Library

[3]

Lluc Alvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc González, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. 2015b. Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 720--732.

Digital Library

[4]

Kristof Beyls and Erik H. D’Hollander. 2002. Reuse distance-based cache hint selection. In Proceedings of the European Conference on Parallel Processing (Euro-Par’02). Springer, 265--275.

Digital Library

[5]

Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. J. Syst. Architect. 51, 4 (2005), 223--250.

Digital Library

[6]

Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46, 2 (Mar. 1999), 281--321.

Digital Library

[7]

Ioana Burcea, Livio Soares, and Andreas Moshovos. 2012. Pointy: A hybrid pointer prefetcher for managed runtime systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 97--106.

Digital Library

[8]

Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 52.

Digital Library

[9]

Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas Subramoney, and Joseph Nuzman. 2012. Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 293--304.

Digital Library

[10]

Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA’07).

Digital Library

[11]

Vladimir Dimić, Miquel Moretó, Marc Casas, and Mateo Valero. 2017. Runtime-assisted shared cache insertion policies based on re-reference intervals. In Proceedings of the European Conference on Parallel Processing (Euro-Par’17). 247--259.

[12]

Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 02 (2011), 173--193.

[13]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language, In Proceedings of the ACM Special Interest Group on Programming Languages (SIGPLAN’98).

Digital Library

[14]

Akansha Jain and Calvin Lin. 2016. Back to the future: Leveraging Belady’s algorithm for improved cache replacement. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 78--89.

Digital Library

[15]

Prabhat Jain, Srinivas Devadas, Daniel Engels, and Larry Rudolph. 2001. Software-assisted cache replacement mechanisms for embedded systems. In Proceedings of the 2001 IEEE/ACM International Conference on Computer-aided Design (ICCAD’01). IEEE Press, Piscataway, NJ, 119--126.

Digital Library

[16]

Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel S. Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). 151--162.

Digital Library

[17]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 60--71.

Digital Library

[18]

Jonas Jalminger and Per Stenström. 2003. A novel approach to cache block reuse predictions. In Proceedings of the International Conference on Parallel Processing. IEEE, 294--302.

[19]

Teresa L. Johnson, Daniel A. Connors, Matthew C. Merten, and Wen-mei W. Hwu. 1999. Run-time cache bypassing. IEEE Trans. Comput. 48, 12 (Dec. 1999), 1338--1354.

Digital Library

[20]

Martin Kampe, Per Stenstrom, and Michel Dubois. 2004. Self-correcting LRU replacement policies. In Proceedings of the 1st Conference on Computing Frontiers (CF’04). 181--191.

Digital Library

[21]

Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’01). ACM, New York, NY, 240--251.

Digital Library

[22]

Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In Proceedings of the 25th International Conference on Computer Design (ICCD’07). IEEE, 245--250.

[23]

Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). IEEE Computer Society, Washington, DC, 175--186.

Digital Library

[24]

Mazen Kharbutli and Yan Solihin. 2008. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57, 4 (Apr. 2008), 433--447.

Digital Library

[25]

Costas Kyriacou, Paraskevas Evripidou, and Pedro Trancoso. 2004. Cacheflow: A short-term optimal cache management policy for data driven multithreading. In Proceedings of the European Conference on Parallel Processing (Euro-Par’17). Springer, 561--570.

[26]

An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’01). ACM, New York, NY, 144--154.

Digital Library

[27]

Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’08). IEEE Computer Society, Washington, DC, 222--233.

Digital Library

[28]

Madhavan Manivannan, Anurag Negi, and Per Stenstrom. 2013. Efficient forwarding of producer-consumer data in task-based programs. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP’13). 517--522.

Digital Library

[29]

Madhavan Manivannan, Vassilis Papaefstathiou, Miquel Pericas, and Per Stenstrom. 2016. RADAR: Runtime-assisted dead region management for last-level caches. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 644--656.

[30]

Madhavan Manivannan and Per Stenstrom. 2014. Runtime-guided cache coherence optimizations in multi-core architectures. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 625--636.

Digital Library

[31]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 3--14.

Digital Library

[32]

Ragavendra Natarajan and Mainak Chaudhuri. 2013. Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’13). 1--10.

[33]

OpenMP Architecture Review Board. 2005. OpenMP API Version 2.5.

[34]

OpenMP Architecture Review Board. 2013. OpenMP API Version 4.0.

[35]

Abhisek Pan and Vijay S. Pai. 2015. Runtime-driven shared last-level cache management for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, New York, NY, 11:1--11:12.

Digital Library

[36]

Vassilis Papaefstathiou, Manolis G. H. Katevenis, Dimitrios S. Nikolopoulos, and Dionisios Pnevmatikatos. 2013. Prefetching and cache management using task lifetimes. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 325--334.

Digital Library

[37]

J. M. Perez, R. M. Badia, and J. Labarta. 2008. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster’08). 142--151.

[38]

Miquel Pericàs, Kenjiro Taura, and Satoshi Matsuoka. 2014. Scalable analysis of multicore data reuse and sharing. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, New York, NY, 353--362.

Digital Library

[39]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 381--391.

Digital Library

[40]

Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson, and Matt Farrens. 1998. Utilizing reuse information in data cache management. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). ACM, New York, NY, 449--456.

Digital Library

[41]

Jennifer B. Sartor, Wim Heirman, Stephen M. Blackburn, Lieven Eeckhout, and Kathryn S. McKinley. 2014. Cooperative cache scrubbing. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 15--26.

Digital Library

[42]

Jennifer B. Sartor, Subramaniam Venkiteswaran, Kathryn S. McKinley, and Zhenlin Wang. 2005. Cooperative caching with keep-me and evict-me. In Proceedings of the 9th Annual Workshop on Interaction Between Compilers and Computer Architectures (INTERACT’05). IEEE, 46--57.

Digital Library

[43]

Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 51--62.

Digital Library

[44]

Yingying Tian, Samira M. Khan, and Daniel A. Jiménez. 2013. Temporal-based multilevel correlating inclusive cache replacement. ACM Trans. Archit. Code Optim. 10, 4 (Dec. 2013).

Digital Library

[45]

Mateo Valero, Miquel Moreto, Marc Casas, Eduard Ayguade, and Jesus Labarta. 2014. Runtime-aware architectures: A first approach. Supercomput. Frontiers Innovat. 1, 1 (2014).

Digital Library

[46]

H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedinsg of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 1--11.

Digital Library

[47]

Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE Computer Society, Washington, DC, 199.

Digital Library

[48]

David A. Wood, Mark D. Hill, and R. E. Kessler. 1991. A model for estimating trace-sample miss ratios. In Proceedings of the ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’91). ACM, New York, NY, 79--89.

Digital Library

[49]

Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 430--441.

Digital Library

[50]

Vinson Young, Chien Chen, Aamer Jaleel, and Moin Qureshi. 2017. SHiP++: Enhancing signature-based hit predictor for improved cache performance. In Proceedings of the Cache Replacement Championship (CRC’17) held in Conjunction with the International Symposium on Computer Architecture (ISCA’17).

[51]

Mohamed Zahran. 2007. Cache replacement policy revisited. In Proceedings of the Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD’07) held in conjunction with the International Symposium on Computer Architecture (ISCA’07).

[52]

Mohamed Zahran and Sally A. McKee. 2010. Global management of cache hierarchies. In Proceedings of the 7th Conference on Computing Frontiers.

Digital Library

Cited By

Das PRoy B(2020)A Categorical Study on Cache Replacement Policies for Hierarchical Cache MemoryApplications of Internet of Things10.1007/978-981-15-6198-6_19(201-211)Online publication date: 4-Aug-2020
https://doi.org/10.1007/978-981-15-6198-6_19

Index Terms

Global Dead-Block Management for Task-Parallel Programs
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

The impact of cache inclusion policies on cache management techniques
MEMSYS '19: Proceedings of the International Symposium on Memory Systems

Caches were designed to mitigate the large disparity between processor and memory speeds. Many last-level cache (LLC) management techniques have been designed to further improve performance. A first observation is that most techniques in the literature ...
Dead-block prediction & dead-block correlating prefetchers
ISCA '01: Proceedings of the 28th annual international symposium on Computer architecture

Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data ...
Dead-block prediction & dead-block correlating prefetchers
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)

Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3

September 2018

322 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3274266

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018

Accepted: 01 May 2018

Revised: 01 May 2018

Received: 01 August 2017

Published in TACO Volume 15, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

European Research Council (ERC)
MECCA project
Swedish National Infrastructure for Computing (SNIC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
623
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)20

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Das PRoy B(2020)A Categorical Study on Cache Replacement Policies for Hierarchical Cache MemoryApplications of Internet of Things10.1007/978-981-15-6198-6_19(201-211)Online publication date: 4-Aug-2020
https://doi.org/10.1007/978-981-15-6198-6_19

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents