Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Global Dead-Block Management for Task-Parallel Programs

Published: 04 September 2018 Publication History

Abstract

Task-parallel programs inefficiently utilize the cache hierarchy due to the presence of dead blocks in caches. Dead blocks may occupy cache space in multiple cache levels for a long time without providing any utility until they are finally evicted. Existing dead-block prediction schemes take decisions locally for each cache level and do not efficiently manage the entire cache hierarchy. This article introduces runtime-orchestrated global dead-block management, in which static and dynamic information about tasks available to the runtime system is used to effectively detect and manage dead blocks across the cache hierarchy. In the proposed global management schemes, static information (e.g., when tasks start/finish, and what data regions tasks produce/consume) is combined with dynamic information to detect when/where blocks become dead. When memory regions are deemed dead at some cache level(s), all the associated cache blocks are evicted from the corresponding level(s). We extend the cache controllers at both private and shared cache levels to use the aforementioned information to evict dead blocks. The article does an extensive evaluation of both inclusive and non-inclusive cache hierarchies and shows that the proposed global schemes outperform existing local dead-block management schemes.

References

[1]
Jaume Abella, Antonio González, Xavier Vera, and Michael F. P. O’Boyle. 2005. IATAC: A smart predictor to turn-off L2 cache lines. ACM Trans. Archit. Code Optim. 2, 1 (Mar. 2005), 55--77.
[2]
Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, Xavier Martorell, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. 2015a. Runtime-guided management of hybrid memory hierarchies in multicore architectures. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation.
[3]
Lluc Alvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc González, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. 2015b. Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 720--732.
[4]
Kristof Beyls and Erik H. D’Hollander. 2002. Reuse distance-based cache hint selection. In Proceedings of the European Conference on Parallel Processing (Euro-Par’02). Springer, 265--275.
[5]
Kristof Beyls and Erik H. D’Hollander. 2005. Generating cache hints for improved program efficiency. J. Syst. Architect. 51, 4 (2005), 223--250.
[6]
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46, 2 (Mar. 1999), 281--321.
[7]
Ioana Burcea, Livio Soares, and Andreas Moshovos. 2012. Pointy: A hybrid pointer prefetcher for managed runtime systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 97--106.
[8]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 52.
[9]
Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas Subramoney, and Joseph Nuzman. 2012. Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 293--304.
[10]
Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. 2007. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the Symposium on Parallel Algorithms and Architectures (SPAA’07).
[11]
Vladimir Dimić, Miquel Moretó, Marc Casas, and Mateo Valero. 2017. Runtime-assisted shared cache insertion policies based on re-reference intervals. In Proceedings of the European Conference on Parallel Processing (Euro-Par’17). 247--259.
[12]
Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21, 02 (2011), 173--193.
[13]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language, In Proceedings of the ACM Special Interest Group on Programming Languages (SIGPLAN’98).
[14]
Akansha Jain and Calvin Lin. 2016. Back to the future: Leveraging Belady’s algorithm for improved cache replacement. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 78--89.
[15]
Prabhat Jain, Srinivas Devadas, Daniel Engels, and Larry Rudolph. 2001. Software-assisted cache replacement mechanisms for embedded systems. In Proceedings of the 2001 IEEE/ACM International Conference on Computer-aided Design (ICCAD’01). IEEE Press, Piscataway, NJ, 119--126.
[16]
Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel S. Emer. 2010a. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). 151--162.
[17]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010b. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 60--71.
[18]
Jonas Jalminger and Per Stenström. 2003. A novel approach to cache block reuse predictions. In Proceedings of the International Conference on Parallel Processing. IEEE, 294--302.
[19]
Teresa L. Johnson, Daniel A. Connors, Matthew C. Merten, and Wen-mei W. Hwu. 1999. Run-time cache bypassing. IEEE Trans. Comput. 48, 12 (Dec. 1999), 1338--1354.
[20]
Martin Kampe, Per Stenstrom, and Michel Dubois. 2004. Self-correcting LRU replacement policies. In Proceedings of the 1st Conference on Computing Frontiers (CF’04). 181--191.
[21]
Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’01). ACM, New York, NY, 240--251.
[22]
Georgios Keramidas, Pavlos Petoumenos, and Stefanos Kaxiras. 2007. Cache replacement based on reuse-distance prediction. In Proceedings of the 25th International Conference on Computer Design (ICCD’07). IEEE, 245--250.
[23]
Samira Manabi Khan, Yingying Tian, and Daniel A. Jimenez. 2010. Sampling dead block prediction for last-level caches. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). IEEE Computer Society, Washington, DC, 175--186.
[24]
Mazen Kharbutli and Yan Solihin. 2008. Counter-based cache replacement and bypassing algorithms. IEEE Trans. Comput. 57, 4 (Apr. 2008), 433--447.
[25]
Costas Kyriacou, Paraskevas Evripidou, and Pedro Trancoso. 2004. Cacheflow: A short-term optimal cache management policy for data driven multithreading. In Proceedings of the European Conference on Parallel Processing (Euro-Par’17). Springer, 561--570.
[26]
An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA’01). ACM, New York, NY, 144--154.
[27]
Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’08). IEEE Computer Society, Washington, DC, 222--233.
[28]
Madhavan Manivannan, Anurag Negi, and Per Stenstrom. 2013. Efficient forwarding of producer-consumer data in task-based programs. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP’13). 517--522.
[29]
Madhavan Manivannan, Vassilis Papaefstathiou, Miquel Pericas, and Per Stenstrom. 2016. RADAR: Runtime-assisted dead region management for last-level caches. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 644--656.
[30]
Madhavan Manivannan and Per Stenstrom. 2014. Runtime-guided cache coherence optimizations in multi-core architectures. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 625--636.
[31]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 3--14.
[32]
Ragavendra Natarajan and Mainak Chaudhuri. 2013. Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’13). 1--10.
[33]
OpenMP Architecture Review Board. 2005. OpenMP API Version 2.5.
[34]
OpenMP Architecture Review Board. 2013. OpenMP API Version 4.0.
[35]
Abhisek Pan and Vijay S. Pai. 2015. Runtime-driven shared last-level cache management for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, New York, NY, 11:1--11:12.
[36]
Vassilis Papaefstathiou, Manolis G. H. Katevenis, Dimitrios S. Nikolopoulos, and Dionisios Pnevmatikatos. 2013. Prefetching and cache management using task lifetimes. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM, 325--334.
[37]
J. M. Perez, R. M. Badia, and J. Labarta. 2008. A dependency-aware task-based programming environment for multi-core architectures. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster’08). 142--151.
[38]
Miquel Pericàs, Kenjiro Taura, and Satoshi Matsuoka. 2014. Scalable analysis of multicore data reuse and sharing. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, New York, NY, 353--362.
[39]
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 381--391.
[40]
Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson, and Matt Farrens. 1998. Utilizing reuse information in data cache management. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). ACM, New York, NY, 449--456.
[41]
Jennifer B. Sartor, Wim Heirman, Stephen M. Blackburn, Lieven Eeckhout, and Kathryn S. McKinley. 2014. Cooperative cache scrubbing. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 15--26.
[42]
Jennifer B. Sartor, Subramaniam Venkiteswaran, Kathryn S. McKinley, and Zhenlin Wang. 2005. Cooperative caching with keep-me and evict-me. In Proceedings of the 9th Annual Workshop on Interaction Between Compilers and Computer Architectures (INTERACT’05). IEEE, 46--57.
[43]
Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 51--62.
[44]
Yingying Tian, Samira M. Khan, and Daniel A. Jiménez. 2013. Temporal-based multilevel correlating inclusive cache replacement. ACM Trans. Archit. Code Optim. 10, 4 (Dec. 2013).
[45]
Mateo Valero, Miquel Moreto, Marc Casas, Eduard Ayguade, and Jesus Labarta. 2014. Runtime-aware architectures: A first approach. Supercomput. Frontiers Innovat. 1, 1 (2014).
[46]
H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In Proceedinsg of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 1--11.
[47]
Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. 2002. Using the compiler to improve cache replacement decisions. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE Computer Society, Washington, DC, 199.
[48]
David A. Wood, Mark D. Hill, and R. E. Kessler. 1991. A model for estimating trace-sample miss ratios. In Proceedings of the ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’91). ACM, New York, NY, 79--89.
[49]
Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 430--441.
[50]
Vinson Young, Chien Chen, Aamer Jaleel, and Moin Qureshi. 2017. SHiP++: Enhancing signature-based hit predictor for improved cache performance. In Proceedings of the Cache Replacement Championship (CRC’17) held in Conjunction with the International Symposium on Computer Architecture (ISCA’17).
[51]
Mohamed Zahran. 2007. Cache replacement policy revisited. In Proceedings of the Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD’07) held in conjunction with the International Symposium on Computer Architecture (ISCA’07).
[52]
Mohamed Zahran and Sally A. McKee. 2010. Global management of cache hierarchies. In Proceedings of the 7th Conference on Computing Frontiers.

Cited By

View all
  • (2020)A Categorical Study on Cache Replacement Policies for Hierarchical Cache MemoryApplications of Internet of Things10.1007/978-981-15-6198-6_19(201-211)Online publication date: 4-Aug-2020

Index Terms

  1. Global Dead-Block Management for Task-Parallel Programs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
    September 2018
    322 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3274266
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 September 2018
    Accepted: 01 May 2018
    Revised: 01 May 2018
    Received: 01 August 2017
    Published in TACO Volume 15, Issue 3

    Check for updates

    Author Tags

    1. Multi-core architecture
    2. cache hierarchy
    3. dead blocks
    4. runtime system
    5. task parallelism

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • European Research Council (ERC)
    • MECCA project
    • Swedish National Infrastructure for Computing (SNIC)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)108
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)A Categorical Study on Cache Replacement Policies for Hierarchical Cache MemoryApplications of Internet of Things10.1007/978-981-15-6198-6_19(201-211)Online publication date: 4-Aug-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media