Abstract
The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10 %. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15 %. The two proposed mechanisms can together enhance the performance by 25 % averagely.
Similar content being viewed by others
References
Trawick T (2007) Multicore communication: today and the future. Embed Comput Des
Parkhurst J, Darringer J, Grundmann B (2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the 2006 IEEE/ACM international conference on computer-aided design, November 2006, pp 67–72
Haritan E, Yagi H, Wolf W, Hattori T, Paulin P, Nohl A, Wingard D, Muller M (2008) Multicore design is the challenge! What is the solution? In: Proceedings of design automation conference, June 2008, pp 128–130
Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Proceedings of seventh IEEE international symposium on cluster computing and the grid, May 2007, pp 471–478
Marino MD (2006) 32-core CMP with multi-sliced L2, 2 and 4 cores sharing a L2 slice. In: Proceedings of symposium on computer architecture and high performance computing, October 2006, pp 141–150
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceeding of international conference of architectural support for programming languages and operating systems, pp 211–222
Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. IEEE Comput Mag January:70–78
Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of design automation conference, June 2001, pp 684–689
Bambha NK, Bhattacharyya SS (2005) Joint application mapping/interconnect synthesis techniques for embedded chip-scale multiprocessors. IEEE Trans Parallel Distrib Syst 16(2):99–112
Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergio S, Benini L, Micheli GD (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129
Lee J, Lee K, Yoo H-J (2005) Packet-switched on-chip interconnection network for system-on-chip applications. IEEE Trans Circuits Syst 52(6):308–312
Pande PP, Micheli GD, Grecu C, Ivanov A, Saleh R (2005) Design, synthesis, and test of networks on chips. IEEE Des Test Comput 22(5):404–413
Chang K-C, Shen J-S, Chen T-F (2006) Evaluation and design trade-offs between circuit-switched and packet-switched NoCs for application-specific SoCs. In: Proceedings of design automation conference, July 2006, pp 143–148
Chang K-C, Shen J-S, Chen T-F (2008) Tailoring circuit-switched network-on-chip to application-specific SoC. ACM Trans Des Autom Electron Syst 13(1):1–31
Kim C, Burger D, Keckler SW (2003) An adaptive, non uniform cache structure for wire delay dominated on chip caches. IEEE MICRO, 99–107
Zhou X, Yu C, Dash A, Petrove P (2008) Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans Des Autom Electron Syst 13(1)
Brown JA, Kumar R, Tullsen D (2007) Proximity-aware directory-based coherence for multi-core processor architectures. In: Proceedings of the nineteenth annual ACM symposium on parallel algorithms and architectures, San Diego, California, USA, pp 126–134
de Massas PG, Pétro F (2008) Comparison of memory write policies for NoC based multicore cache coherent systems. In: Proceedings of design, automation and test in Europe, March 2008, pp 997–1002
Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2007) A NUCA substrate for flexible CMP cache sharing. IEEE Trans Parallel Distrib Syst 18(8):1028–1040
Foglia P, Mangano D, Prete CA (2005) A NUCA model for embedded systems cache design. In: Proceedings of workshop on embedded systems for real-time multimedia, September 2005, pp 41–46
Loghi M, Letis M, Benini L, Poncino M (2005) Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors. In: Proceedings of the 15th ACM great lakes symposium on VLSI, April 2005, pp 276–281
Lira J, Molina C, González A (2009) Analysis of non-uniform cache architecture policies for chip-multiprocessor using the parsec benchmark suite. In: Proceedings of the workshop on managed many-core systems, March 2009
Mohapatra P (1998) Wormhole routing techniques for directly connected multicomputer system. Proc ACM Comput Surv 30(3):374–410
Open SystemC Initiative. http://www.systemc.org/home
Tomasevic M, Milutinovic VM (1994) Hardware approaches to cache coherence in shared-memory multiprocessors. IEEE MICRO 14(5–6):52–59
Gracia DS, Dimitrakopoulos G, Arnal TM, Katevenis MGH, Yufera VV (2011) LP-NUCA: networks-in-cache for high-performance low-power embedded processors. IEEE Trans Very Large Scale Integr Syst
Bolotin E, Guz Z, Cidon I, Ginosar R, Kolodny A (2007) The power of priority: NoC based distributed cache coherency. In: Proceedings of the international symposium on networks-on-chip, May 2007, pp 117–126
SPEC OMP. http://www.spec.org/omp
Magnussion PS et al (2002) Simics: a full system simulation platform. Computer 35(2):50–58
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chang, KC., Liao, IM. & Liao, CH. Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms. J Supercomput 62, 1318–1337 (2012). https://doi.org/10.1007/s11227-012-0793-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0793-7