Abstract
Shared last level cache is crucial to performance. However, multi-thread program model incurs serious contention in shared cache. In this paper, to reduce average cache access latency, we propose two schemes. First, an implicitly dynamic cache partitioning scheme, i.e. block agglutinating. The purpose is to isolate conflicting data blocks. Second, a novel hardware buffer, called thread owned block cache, i.e. TOB Cache. The purpose is to store conflicting data blocks. Extensive analysis of the proposed schemes with Splash2 benchmarks and Bioinformatics workloads is performed using a cycle accurate many-core simulator. Experimental results show that the proposed schemes make conflict miss rate of shared cache reduced by 40% compared to traditional shared cache. Compared with victim cache, average load latency of shared cache and primary data cache is reduced by about 26% and 12%, respectively; primary data cache miss penalties are reduced by about 14%, and IPC is improved by 17%.
Chapter PDF
Similar content being viewed by others
References
Kim, C., Burger, D., et al.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: ASPLOS 2002 (2002)
Zhang, C.: Balanced cache: Reducing conflict misses of direct-mapped caches. In: ISCA 2006 (2006)
Almasi, G., et al.: Dissecting Cyclops: A Detailed Analysis of a Multithreaded Architecture. ACM SIGARCH Computer Architecture News 31(1), 26–38 (2003)
Suh, G.E., Rudolph, L., Devadas, S.: Dynamic Partitioning of Shared Cache Memory. The Journal of Supercomputing 28(1), 7–26 (2004)
Pfister, G.F., Norton, V.A.: ‘Hot-spot’ contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 943–948 (1985)
Collins, J.D., Tullsen, D.M.: Runtime identification of cache conflict misses: The adaptive miss buffer. ACM Trans. Comput. Syst. 19(4), 413–439 (2001)
Huh, J., Kim, C., Shafi, H., et al.: A NUCA substrate for flexible CMP cache sharing. In: ICS 2005, June 20-22 (2005)
Chang, J., Sohi, G.S.: Cooperative Caching for Chip Multiprocessors. In: ISCA 2006 (2006)
Chang, J., Sohi, G.S.: Cooperative Cache Partitioning for Chip Multiprocessors. In: ICS 2007 (2007)
Asanovic, K., et al.: The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No.UCB/EECS-2006-183, December 18 (2006)
Olukotun, K., et al.: The Case for a Single-Chip Multiprocessor. In: ASPLOLS VII (October 1996)
Taylor, M.B., et al.: Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and Streams. In: ISCA-31 (June 2004)
Kandemir, M., Li, F., et al.: A Novel Migratrion-Based NUCA Design for Chip Multiprocessors. In: SC 2008, Austin, Texas (November 2008)
Qureshi, M.K.: Adaptive Spill-Receive for Robust High-Performance Caching in CMPs. In: HPCA 2009 (2009)
Qureshi, M.K., Patt, Y.N.: Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In: MICRO 2006 (2006)
Song, F., Liu, Z., et al.: An Implicitly Dynamic Shared Cache Isolation in Many-Core Architecture. Chinese Journal of Computer 32(10), 1896–1904 (2009)
Zhang, M., Asanovic, K.: Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches. MIT-CSAIL-TR-2005-064, MIT-LCS-TR-1006, October 10 (2005)
Jouppi, N.P.: Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: ISCA 1990 (1990)
Topham, N., et al.: The design and performance of a conflict-avoiding cache. In: Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 71–80 (1997)
Hardavellas, N., et al.: R-NUCA: Data Placement in Distributed Shared Caches. In: ISCA 2009 (2009)
Kongetira, P., Aingaran, K., et al.: Niagara: a 32-Way Multithreaded Sparc Processor. In: HotChips’16 (2005)
Woo, S.C., Ohara, M., et al.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: ISCA 1995, pp. 24–36 (June 1995)
Kim, S., Chandra, D., Solihin, Y.: Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In: PACT 2004 (2004)
Srikantaiah, S., Kandemir, M., Irwin, M.J.: Adaptive set pinning: Managing shared caches in chip multiprocessors. In: ASPLOS 2008 (2008)
Fan, D., et al.: Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions. Journal of Computer Science and Technology 24(6), 1061–1073 (2009)
Fu, Y., et al.: Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 20, 1948–1954 (2004)
Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 (CACTI 6.0: A Tool to Model Large Caches.). In: Micro (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, F., Liu, Z., Fan, D., Zhang, H., Yu, L., Tang, S. (2010). Thread Owned Block Cache: Managing Latency in Many-Core Architecture. In: D’Ambra, P., Guarracino, M., Talia, D. (eds) Euro-Par 2010 - Parallel Processing. Euro-Par 2010. Lecture Notes in Computer Science, vol 6271. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15277-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-15277-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15276-4
Online ISBN: 978-3-642-15277-1
eBook Packages: Computer ScienceComputer Science (R0)