Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Published: 01 December 2015 Publication History

Abstract

As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of $$2\times 2$$2 2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.

References

[1]
Acacio, M.E., González, J., García, J.M., Duato, J.: Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture. In: Supercomputing, ACM/IEEE 2002 Conference, pp 49---49 (2002)
[2]
Agarwal, A., Paul, B., Mahmoodi, H., Datta, A., Roy, K.: A process-tolerant cache architecture for improved yield in nanoscale technologies. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 13(1), 27---38 (2005)
[3]
Agarwal, N., Krishna, T., Peh, L., Jha, N.: Garnet: a detailed on-chip network model inside a full-system simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'09), pp 33---42 (2009)
[4]
Barroso, L., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., Verghese, B.: Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00), ACM, pp 282---293 (2000)
[5]
Barrow-Williams, N., Fensch, C., Moore, S.: Proximity coherence for chip multiprocessors. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10), ACM, pp 123---134 (2010)
[6]
Beckmann, B.M., Marty, M.R., Wood, D.A.: (2006) ASR: Adaptive Selective Replication for CMP Caches. In: Proceedings of 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp 443---454
[7]
Bell, S., Edwards, B., et al.: TILE64--Processor: A 64-Core SoC with Mesh Interconnect. In: Proceedings of 2008 IEEE International Solid-State Circuits Conference (ISSCC'08), pp 88---598 (2008)
[8]
Beu, J., Rosier, M., Conte, T.: Manager-client pairing: a framework for implementing coherence hierarchies. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'11), ACM, pp 226---236 (2011)
[9]
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC Benchmark Suite: Characterization and Architectural Implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08), pp 72---81 (2008)
[10]
Brown, J., Kumar, R., Tullsen, D.: Proximity-aware directory-based coherence for multi-core processor architectures. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'07), ACM, pp 126---134 (2007)
[11]
Chang, J., Sohi, G.S.: Cooperative Caching for Chip Multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06), pp 264---276 (2006)
[12]
Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05), pp 357---368 (2005).
[13]
Eisley, N., Peh, L., Shang, L.: In-network cache coherence. In: Proceedings of 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06), pp 321---332 (2006)
[14]
Ferdman, M., Lotfi-Kamran, P., Balet, K., Falsafi, B.: Cuckoo directory: a scalable directory for many-core systems. In: Proceedings 17th International Symposium on High Performance Computer Architecture (HPCA'11), pp 169---180 (2011)
[15]
Gupta, A., Weber, W.D., Mowry, T.C.: Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of 1990 International Conference on Parallel Processing (ICPP'90), pp 312---321 (1990)
[16]
Hardavellas, N., Ferdman, M., Falsafi, B., Ailamaki, A.: Reactive NUCA: near-optimal block placement and replication in distributed caches. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09), pp 184---195 (2009)
[17]
Herrero, E., González, J., Canal, R.: Distributed cooperative caching. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08) pp 134---143. ACM(2008)
[18]
Hossain, H., Dwarkadas, S., Huang, M.C.: POPS: coherence protocol optimization for both private and shared data. In: Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT'11), pp 45---55 (2011)
[19]
Kahng, A., Li, B., Peh, L.S., Samadi, K.: ORION 2.0: a fast and accurate noc power and area model for early-stage design space exploration. In: Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE'09), pp 423---428 (2009)
[20]
Kim, C., Burger, D., Keckler, S.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th Annual IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'02), pp 211---222. ACM (2002)
[21]
Laudon, J., Lenoski, D.: The SGI origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA'97), pp 241---251 (1997)
[22]
Maa, Y.C., Pradhan, D.K., Thiebaut, D.: A hierarchical directory scheme for large-scale cache-coherent multiprocessors. In: Proceedings of 6th International Symposium on Parallel Processing (IPPS'92), pp 43---46 (1992)
[23]
Magnusson, P., Christensson, M., et al.: Simics: a full system simulation platform. Computer 35(2), 50---58 (2002)
[24]
Martin, M.M., Sorin, D.J., et al.: Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Comput. Archit. News (CAN) 33(4), 92---99 (2005)
[25]
Muralimanohar, N., Balasubramonian, R., Jouppi, N.: Cacti 6.0: a tool to understand large caches. Technical report, HP Tech Report HPL-2009 (2009)
[26]
Noel, E., Li-Shiuan, P., Li, S.: Leveraging on-chip networks for data cache migration in chip multiprocessors. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08), pp 197---207 (2008)
[27]
Roberts, D., Kim, N., Mudge, T.: On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology. Microprocess. Microsyst. 32(5), 244---253 (2008)
[28]
Shin, J.L., Tam, K., et al.: A 40 nm 16-Core 128-Thread CMT SPARC SoC Processor. In: Proceedings of 2010 IEEE International Solid-State Circuits Conference (ISSCC'10), pp 98---99 (2010)
[29]
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95), pp 24---36 (1995)
[30]
Yoon, D., Erez, M.: Memory mapped ecc: low-cost error protection for last level caches. In: Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA'09), pp 116---127. ACM (2009)
[31]
Zhang, M., Asanovic, K.: Victim migration: dynamically adapting between private and shared cmp caches. Technical report, Massachusetts Inst Of Tech Cambridge Computer Science and Artificial Intelligence Lab (2005)
[32]
Zhang, M., Asanovic, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA'05), pp 336---345 (2005)
[33]
Zhang, M., Lebeck, A., Sorin, D.: Fractal coherence: scalably verifiable cache coherence. In: Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'10), pp 471---482 (2010)
[34]
Zhang, Y., Lu, Z., Jantsch, A., Li, L., Gao, M.: Towards hierarchical cluster based cache coherence for large-scale network-on-chip. In: Proceedings of the 4th International Conference on Design and Technology of Integrated Systems in Nanoscal Era (DTIS'09), pp 119---122 (2009)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming
International Journal of Parallel Programming  Volume 43, Issue 6
December 2015
283 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media