research-article

The locality-aware adaptive cache coherence protocol

Authors:

Srinivas DevadasAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 41, Issue 3

Pages 523 - 534

https://doi.org/10.1145/2508148.2485967

Published: 23 June 2013 Publication History

Abstract

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.

References

[1]

DARPA UHPC Program BAA. https://www.fbo.gov/spg/ODA/DARPA/CMO/DARPA-BAA-10-37/listing.html, March 2010.

[2]

S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. Tile64 - processor: A 64-core soc with mesh interconnect. In International Solid-State Circuits Conference, 2008.

[3]

C. Bienia, S. Kumar, J. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Int'l Conference on Parallel Architectures and Compilation Techniques, 2008.

Digital Library

[4]

P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. In IEEE Micro, 30(2): 16--29, 2010.

Digital Library

[5]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. In Int'l Symposium on Computer Architecture, 2009.

Digital Library

[6]

H. Hoffmann, D. Wentzlaff, and A. Agarwal. Remote Store Programming: A memory model for embedded multicore. In International Conference on High Performance Embedded Architectures and Compilers, 2010.

Digital Library

[7]

S. Iqbal, Y. Liang, and H. Grahn. ParMiBench - an open-source benchmark for embedded multiprocessor systems. Computer Architecture Letters, 2010.

Digital Library

[8]

A. Jaleel, E. Borch, M. Bhandaru, S. C. Steely Jr., and J. Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Int'l Symposium on Microarchitecture, 2010.

Digital Library

[9]

N. E. Jerger, L.-S. Peh, and M. Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In Int'l Symposium on Computer Architecture, 2008.

Digital Library

[10]

T. L. Johnson and W.-M. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In Int'l Symposium on Computer architecture, 1997.

Digital Library

[11]

H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design - opportunities and challenges. In Design Automation Conference, 2012.

Digital Library

[12]

C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In Int'l Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[13]

G. Kurian, J. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. Kimerling, and A. Agarwal. ATAC: A 1000-Core Cache-Coherent Processor with On-Chip Optical Network. In Int'l Conference on Parallel Architectures and Compilation Techniques, 2010.

Digital Library

[14]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Int'l Symposium on Microarchitecture, 2009.

Digital Library

[15]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Int'l Symposium on Microarchitecture, 2008.

Digital Library

[16]

M. M. K. Martin, M. D. Hill, and D. J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78--89, July 2012.

Digital Library

[17]

J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A Distributed Parallel Simulator for Multicores. In Int'l Symposium on High Performance Computer Architecture, 2010.

[18]

M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Int'l Symposium on Microarchitecture, 2006.

Digital Library

[19]

D. Sanchez and C. Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In Int'l Symposium on High Performance Computer Architecture, 2012.

Digital Library

[20]

C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Int'l Symposium on Networks-on-Chip, 2012.

Digital Library

[21]

G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun. A modified approach to data cache management. In Int'l Symposium on Microarchitecture, 1995.

Digital Library

[22]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Int'l Symposium on Computer Architecture, 1995.

Digital Library

[23]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Int'l Symposium on Microarchitecture, 2009.

Digital Library

[24]

M. Zhang and K. Asanović. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. In Int'l Symposium on Computer Architecture, 2005.

Digital Library

[25]

H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. SPATL: Honey, I Shrunk the Coherence Directory. In Int'l Conference on Parallel Architectures and Compilation Techniques, 2011.

Digital Library

Cited By

Holtryd NManivannan MStenstrom PPericas M(2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00066
Liu YSun X(2019)LPM: A Systematic Methodology for Concurrent Data Access Pattern Optimization from a Matching PerspectiveIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.291257330:11(2478-2493)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1109/TPDS.2019.2912573
Cabrera AChamberlain RBeard J(2019)Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data2019 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2019.8916398(1-8)Online publication date: Sep-2019
https://doi.org/10.1109/HPEC.2019.8916398
Show More Cited By

Index Terms

The locality-aware adaptive cache coherence protocol
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

The locality-aware adaptive cache coherence protocol
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Accelerating cache coherence mechanism with speculation
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Directory is one of the common method to maintain cache coherence in multi/many-core systems. However, directory has problems in area, latency and complexity of protocol. Conversely, directoryless coherence mechanism, where each core invalidates its own ...
An adaptive cache coherence protocol

This paper introduces a new adaptive cache coherence protocol which minimizes energy requirements and guarantees scalability. It includes two complementary parts: a non-inclusive sparse-directory to track only actively shared blocks and a structure to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 41, Issue 3

ICSA '13

June 2013

666 pages

ISSN:0163-5964

DOI:10.1145/2508148

Issue’s Table of Contents

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Published in SIGARCH Volume 41, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
939
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Holtryd NManivannan MStenstrom PPericas M(2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00066
Liu YSun X(2019)LPM: A Systematic Methodology for Concurrent Data Access Pattern Optimization from a Matching PerspectiveIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.291257330:11(2478-2493)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1109/TPDS.2019.2912573
Cabrera AChamberlain RBeard J(2019)Multi-spectral Reuse Distance: Divining Spatial Information from Temporal Data2019 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2019.8916398(1-8)Online publication date: Sep-2019
https://doi.org/10.1109/HPEC.2019.8916398
Sembrant AHagersten EBlack-Schaffer D(2016)Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753269(117-124)Online publication date: Oct-2016
https://doi.org/10.1109/ICCD.2016.7753269
Liu YSun X(2015)LPMProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.97(879-888)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1109/ICPP.2015.97
Joshi ARamasubramanian N(2015)Comparison of significant issues in multicore cache coherenceProceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT)10.1109/ICGCIoT.2015.7380439(108-112)Online publication date: 8-Oct-2015
https://dl.acm.org/doi/10.1109/ICGCIoT.2015.7380439
Zhang AGoens AOswald NGrosser TSorin DNagarajan V(2024)PipeGen: Automated Transformation of a Single-Core Pipeline into a Multicore Pipeline for a Given Memory Consistency ModelProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676889(1-13)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676889
Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Li CJiang FChen SZhang JLiu YFu YXu JMitra TYoung EXiong J(2022)Accelerating Cache Coherence in Manycore Processor through Silicon Photonic ChipletProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549338(1-9)Online publication date: 30-Oct-2022
https://dl.acm.org/doi/10.1145/3508352.3549338
Ibrahim MKayiran OEckert YLoh GJog ASarkar VKim H(2020)Analyzing and Leveraging Shared L1 Caches in GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414623(161-173)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414623
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents