Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

Published: 15 November 2016 Publication History

Abstract

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting locality to improve on-chip traffic and resource utilization is of fundamental importance. Conventional multicore cache management schemes either manage the private cache (L1) or the Last-Level Cache (LLC), while ignoring the other. We propose a holistic locality-aware cache hierarchy management protocol for large-scale multicores. The proposed scheme improves on-chip data access latency and energy consumption by intelligently bypassing cache line replication in the L1 caches, and/or intelligently replicating cache lines in the LLC. The approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality at both L1 cache and the LLC. The decision to bypass L1 and/or replicate in LLC is then based on the measured reuse at the fine granularity of cache lines. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional cache coherence protocols. Moreover, the complexity of the protocol is low since no additional coherence states are created. However, the proposed classifier incurs a 5.6KB per-core storage overhead. On a set of parallel benchmarks, the locality-aware protocol reduces average energy consumption by 26% and completion time by 16%, when compared to the state-of-the-art Reactive-NUCA multicore cache management scheme.

References

[1]
Anant Agarwal, Richard Simoni, John L. Hennessy, and Mark Horowitz. 1988. An evaluation of directory schemes for cache coherence. In International Symposium on Computer Architecture.
[2]
Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In IEEE International Symposium on Workload Characterization (IISWC).
[3]
Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, 443--454.
[4]
Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In International Conference on Parallel Architectures and Compilation Techniques (PACT).
[5]
Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, David Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64 - Processor: A 64-core SoC with mesh interconnect. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2008). Digest of Technical Papers. 88--598.
[6]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques.
[7]
Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, NY, 746--749. 10.1145/1278480.1278667
[8]
Lucien M. Censier and Paul Feautrier. 1978. A new solution to coherence problems in multicache systems. IEEE Trans. Comput. 27, 12 (Dec. 1978), 1112--1118.
[9]
Jichuan Chang and G. S. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture (ISCA’06). 264--276. 10.1109/ISCA.2006.17
[10]
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society, Washington, DC, 357--368. 10.1109/ISCA.2005.39
[11]
William J. Dally and Brian Towles. 2004. Principles and Practices of Interconnection Networks. Morgan Kaufmann.
[12]
Ronald G. Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Dennis Sylvester, David Blaauw, and Trevor Mudge. 2013. Centip3De: A 64-core, 3D stacked near-threshold system. IEEE Micro 33, 2 (2013), 8--16.
[13]
Noel Eisley, Li-Shiuan Peh, and Li Shang. 2006. In-network cache coherence. In IEEE/ACM International Symposium on Microarchitecture (MICRO 39). 321--332.
[14]
Christian Fensch and Marcelo Cintra. 2008. An OS-based alternative to full hardware coherence on tiled CMPs. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA 2008). 355--366.
[15]
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In International Symposium on Computer Architecture.
[16]
Enric Herrero, José González, and Ramon Canal. 2010. Elastic cooperative caching: An autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 419--428.
[17]
Hemet Hossain, Sandhya Dwarkadas, and Michael C. Huang. 2011. POPS: Coherence protocol optimization for both private and shared data. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[18]
Syed Muhammad Zeeshan Iqbal, Yuchen Liang, and Hakan Grahn. 2010. ParMiBench - An open-source benchmark for embedded multiprocessor systems. Computer Architecture Letters (2010).
[19]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In International Symposium on Computer Architecture (ISCA).
[20]
Teresa L. Johnson and Wen-Mei W. Hwu. 1997. Run-time adaptive cache hierarchy management via reference analysis. In International Symposium on Computer Architecture.
[21]
Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-threshold voltage (NTV) design: Opportunities and challenges. In Design Automation Conference.
[22]
George Kurian, Srinivas Devadas, and Omer Khan. 2014. Locality-aware data replication in the last-level cache. In Proceedings of the 2014 IEEE 120th International Symposium on High Performance Computer Architecture (HPCA 2014).
[23]
George Kurian, Omer Khan, and Srinivas Devadas. 2013. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 523--534.
[24]
George Kurian, Jason Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In International Conference on Parallel Architectures and Compilation Techniques.
[25]
George Kurian, Qingchuan Shi, Srinivas Devadas, and Omer Khan. 2015. OSPREY: Implementation of memory consistency models for cache coherence protocols involving invalidation-free data access. In International Conference on Parallel Architectures and Compilation Techniques.
[26]
George Kurian, Chen Sun, Chia-Hsin Owen Chen, Jason E. Miller, Jurgen Michel, Lan Wei, Dimitri A. Antoniadis, Li-Shiuan Peh, Lionel Kimerling, Vladimir Stojanovic, and Anant Agarwal. 2012. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 1117--1130.
[27]
Hyunjin Lee, Sangyeun Cho, and B. R. Childers. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). 219--230.
[28]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO.
[29]
Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In International Symposium on Microarchitecture.
[30]
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7 (2012).
[31]
Javier Merino, Valentin Puente, and Jose A. Gregorio. 2010. ESP-NUCA: A low-cost adaptive non-uniform cache architecture. In Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). 1--10.
[32]
Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In HPCA. 1--12.
[33]
Sunghyun Park, Tushar Krishna, Chia-Hsin Chen, Bhavya Daya, Anantha Chandrakasan, and Li-Shiuan Peh. 2012. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. In Design Automation Conference.
[34]
Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design (ISLPED’00). ACM, New York, NY, 90--95.
[35]
Moinuddin K. Qureshi. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture (HPCA 2009). 45--54.
[36]
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In International Symposium on Computer Architecture (ISCA).
[37]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In International Symposium on Microarchitecture (MICRO).
[38]
Daniel Sanchez and Christos Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In International Symposium on High-Performance Computer Architecture.
[39]
Shekhar Srikantaiah, Emre Kultursay, Tao Zhang, Mahmut Kandemir, Mary Jane Irwin, and Yuan Xie. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA). 231--242.
[40]
Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In International Symposium on Networks-on-Chip.
[41]
Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, Jay B. Brockman, and Norman P. Jouppi. 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA.
[42]
Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. 1995. A modified approach to data cache management. In International Symposium on Microarchitecture.
[43]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In International Conference on Computer Architecture.
[44]
Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael Stonebraker. 2014. Staring into the Abyss: An evaluation of concurrency control with one thousand cores. Proc. VLDB Endow. 8, 3 (Nov. 2014), 209--220.
[45]
Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. 2009. A tagless coherence directory. In International Symposium on Microarchitecture.
[46]
Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In International Conference on Parallel Architectures and Compilation Techniques. 135--146.

Cited By

View all
  • (2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
  • (2021)A perceptron-based replication scheme for managing the shared last level cacheMicroprocessors & Microsystems10.1016/j.micpro.2021.10431085:COnline publication date: 1-Sep-2021
  • (2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
  • Show More Cited By

Index Terms

  1. LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
    December 2016
    648 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3012405
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2016
    Accepted: 01 August 2016
    Revised: 01 June 2016
    Received: 01 December 2015
    Published in TACO Volume 13, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multicore
    2. cache
    3. locality

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)GPU-Enabled Asynchronous Multi-level Checkpoint Caching and PrefetchingProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592987(73-85)Online publication date: 7-Aug-2023
    • (2021)A perceptron-based replication scheme for managing the shared last level cacheMicroprocessors & Microsystems10.1016/j.micpro.2021.10431085:COnline publication date: 1-Sep-2021
    • (2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
    • (2019)VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00099(911-920)Online publication date: May-2019
    • (2019)An exposition on the applications of Locality Aware Scheduling algorithms2019 International Conference on Innovative Computing (ICIC)10.1109/ICIC48496.2019.8966718(1-6)Online publication date: Nov-2019
    • (2019)A Reuse-Degree Based Locality Classifier for Locality-Aware Data ReplicationIEEE Access10.1109/ACCESS.2019.29598407(182207-182216)Online publication date: 2019

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media