article

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Authors:

Dongsheng WangAuthors Info & Claims

International Journal of Parallel Programming, Volume 43, Issue 6

Pages 1054 - 1077

https://doi.org/10.1007/s10766-014-0339-0

Published: 01 December 2015 Publication History

Abstract

As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of $$2\times 2$$2 2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.

References

[1]

Acacio, M.E., González, J., García, J.M., Duato, J.: Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture. In: Supercomputing, ACM/IEEE 2002 Conference, pp 49---49 (2002)

[2]

Agarwal, A., Paul, B., Mahmoodi, H., Datta, A., Roy, K.: A process-tolerant cache architecture for improved yield in nanoscale technologies. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 13(1), 27---38 (2005)

Digital Library

[3]

Agarwal, N., Krishna, T., Peh, L., Jha, N.: Garnet: a detailed on-chip network model inside a full-system simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), pp 33---42 (2009)

[4]

Barroso, L., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., Verghese, B.: Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00), ACM, pp 282---293 (2000)

[5]

Barrow-Williams, N., Fensch, C., Moore, S.: Proximity coherence for chip multiprocessors. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10), ACM, pp 123---134 (2010)

[6]

Beckmann, B.M., Marty, M.R., Wood, D.A.: (2006) ASR: Adaptive Selective Replication for CMP Caches. In: Proceedings of 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp 443---454

[7]

Bell, S., Edwards, B., et al.: TILE64--Processor: A 64-Core SoC with Mesh Interconnect. In: Proceedings of 2008 IEEE International Solid-State Circuits Conference (ISSCC'08), pp 88---598 (2008)

[8]

Beu, J., Rosier, M., Conte, T.: Manager-client pairing: a framework for implementing coherence hierarchies. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11), ACM, pp 226---236 (2011)

[9]

Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC Benchmark Suite: Characterization and Architectural Implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08), pp 72---81 (2008)

[10]

Brown, J., Kumar, R., Tullsen, D.: Proximity-aware directory-based coherence for multi-core processor architectures. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'07), ACM, pp 126---134 (2007)

[11]

Chang, J., Sohi, G.S.: Cooperative Caching for Chip Multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06), pp 264---276 (2006)

[12]

Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05), pp 357---368 (2005).

[13]

Eisley, N., Peh, L., Shang, L.: In-network cache coherence. In: Proceedings of 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp 321---332 (2006)

[14]

Ferdman, M., Lotfi-Kamran, P., Balet, K., Falsafi, B.: Cuckoo directory: a scalable directory for many-core systems. In: Proceedings 17th International Symposium on High Performance Computer Architecture (HPCA'11), pp 169---180 (2011)

[15]

Gupta, A., Weber, W.D., Mowry, T.C.: Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of 1990 International Conference on Parallel Processing (ICPP'90), pp 312---321 (1990)

[16]

Hardavellas, N., Ferdman, M., Falsafi, B., Ailamaki, A.: Reactive NUCA: near-optimal block placement and replication in distributed caches. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09), pp 184---195 (2009)

[17]

Herrero, E., González, J., Canal, R.: Distributed cooperative caching. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08) pp 134---143. ACM(2008)

[18]

Hossain, H., Dwarkadas, S., Huang, M.C.: POPS: coherence protocol optimization for both private and shared data. In: Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT'11), pp 45---55 (2011)

[19]

Kahng, A., Li, B., Peh, L.S., Samadi, K.: ORION 2.0: a fast and accurate noc power and area model for early-stage design space exploration. In: Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE'09), pp 423---428 (2009)

[20]

Kim, C., Burger, D., Keckler, S.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th Annual IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'02), pp 211---222. ACM (2002)

Digital Library

[21]

Laudon, J., Lenoski, D.: The SGI origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), pp 241---251 (1997)

Digital Library

[22]

Maa, Y.C., Pradhan, D.K., Thiebaut, D.: A hierarchical directory scheme for large-scale cache-coherent multiprocessors. In: Proceedings of 6th International Symposium on Parallel Processing (IPPS'92), pp 43---46 (1992)

Digital Library

[23]

Magnusson, P., Christensson, M., et al.: Simics: a full system simulation platform. Computer 35(2), 50---58 (2002)

Digital Library

[24]

Martin, M.M., Sorin, D.J., et al.: Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Comput. Archit. News (CAN) 33(4), 92---99 (2005)

[25]

Muralimanohar, N., Balasubramonian, R., Jouppi, N.: Cacti 6.0: a tool to understand large caches. Technical report, HP Tech Report HPL-2009 (2009)

[26]

Noel, E., Li-Shiuan, P., Li, S.: Leveraging on-chip networks for data cache migration in chip multiprocessors. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08), pp 197---207 (2008)

[27]

Roberts, D., Kim, N., Mudge, T.: On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology. Microprocess. Microsyst. 32(5), 244---253 (2008)

Digital Library

[28]

Shin, J.L., Tam, K., et al.: A 40 nm 16-Core 128-Thread CMT SPARC SoC Processor. In: Proceedings of 2010 IEEE International Solid-State Circuits Conference (ISSCC'10), pp 98---99 (2010)

[29]

Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95), pp 24---36 (1995)

Digital Library

[30]

Yoon, D., Erez, M.: Memory mapped ecc: low-cost error protection for last level caches. In: Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA'09), pp 116---127. ACM (2009)

[31]

Zhang, M., Asanovic, K.: Victim migration: dynamically adapting between private and shared cmp caches. Technical report, Massachusetts Inst Of Tech Cambridge Computer Science and Artificial Intelligence Lab (2005)

[32]

Zhang, M., Asanovic, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05), pp 336---345 (2005)

[33]

Zhang, M., Lebeck, A., Sorin, D.: Fractal coherence: scalably verifiable cache coherence. In: Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10), pp 471---482 (2010)

[34]

Zhang, Y., Lu, Z., Jantsch, A., Li, L., Gao, M.: Towards hierarchical cluster based cache coherence for large-scale network-on-chip. In: Proceedings of the 4th International Conference on Design and Technology of Integrated Systems in Nanoscal Era (DTIS'09), pp 119---122 (2009)

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Recommendations

Cluster Cache Monitor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance Computing

As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 43, Issue 6

December 2015

283 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 2015 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents