Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Fenglong Song¹⁹,
Zhiyong Liu¹⁹,
Dongrui Fan¹⁹,
Hao Zhang¹⁹,
Lei Yu¹⁹ &
…
Shibin Tang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6271))

Included in the following conference series:

European Conference on Parallel Processing

1274 Accesses

Abstract

Shared last level cache is crucial to performance. However, multi-thread program model incurs serious contention in shared cache. In this paper, to reduce average cache access latency, we propose two schemes. First, an implicitly dynamic cache partitioning scheme, i.e. block agglutinating. The purpose is to isolate conflicting data blocks. Second, a novel hardware buffer, called thread owned block cache, i.e. TOB Cache. The purpose is to store conflicting data blocks. Extensive analysis of the proposed schemes with Splash2 benchmarks and Bioinformatics workloads is performed using a cycle accurate many-core simulator. Experimental results show that the proposed schemes make conflict miss rate of shared cache reduced by 40% compared to traditional shared cache. Compared with victim cache, average load latency of shared cache and primary data cache is reduced by about 26% and 12%, respectively; primary data cache miss penalties are reduced by about 14%, and IPC is improved by 17%.

Download to read the full chapter text

Chapter PDF

SRCP: sharing and reuse-aware replacement policy for the partitioned cache in multicore systems

Article 12 June 2021

PUMA: From Simultaneous to Parallel for Shared Memory System in Multi-core

Article 30 June 2015

Shared write buffer to boost applications on SpMT architecture

Article 08 April 2016

Keywords

References

Kim, C., Burger, D., et al.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: ASPLOS 2002 (2002)
Google Scholar
Zhang, C.: Balanced cache: Reducing conflict misses of direct-mapped caches. In: ISCA 2006 (2006)
Google Scholar
Almasi, G., et al.: Dissecting Cyclops: A Detailed Analysis of a Multithreaded Architecture. ACM SIGARCH Computer Architecture News 31(1), 26–38 (2003)
Article Google Scholar
Suh, G.E., Rudolph, L., Devadas, S.: Dynamic Partitioning of Shared Cache Memory. The Journal of Supercomputing 28(1), 7–26 (2004)
Article MATH Google Scholar
Pfister, G.F., Norton, V.A.: ‘Hot-spot’ contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 943–948 (1985)
Article Google Scholar
Collins, J.D., Tullsen, D.M.: Runtime identification of cache conflict misses: The adaptive miss buffer. ACM Trans. Comput. Syst. 19(4), 413–439 (2001)
Article Google Scholar
Huh, J., Kim, C., Shafi, H., et al.: A NUCA substrate for flexible CMP cache sharing. In: ICS 2005, June 20-22 (2005)
Google Scholar
Chang, J., Sohi, G.S.: Cooperative Caching for Chip Multiprocessors. In: ISCA 2006 (2006)
Google Scholar
Chang, J., Sohi, G.S.: Cooperative Cache Partitioning for Chip Multiprocessors. In: ICS 2007 (2007)
Google Scholar
Asanovic, K., et al.: The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No.UCB/EECS-2006-183, December 18 (2006)
Google Scholar
Olukotun, K., et al.: The Case for a Single-Chip Multiprocessor. In: ASPLOLS VII (October 1996)
Google Scholar
Taylor, M.B., et al.: Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and Streams. In: ISCA-31 (June 2004)
Google Scholar
Kandemir, M., Li, F., et al.: A Novel Migratrion-Based NUCA Design for Chip Multiprocessors. In: SC 2008, Austin, Texas (November 2008)
Google Scholar
Qureshi, M.K.: Adaptive Spill-Receive for Robust High-Performance Caching in CMPs. In: HPCA 2009 (2009)
Google Scholar
Qureshi, M.K., Patt, Y.N.: Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In: MICRO 2006 (2006)
Google Scholar
Song, F., Liu, Z., et al.: An Implicitly Dynamic Shared Cache Isolation in Many-Core Architecture. Chinese Journal of Computer 32(10), 1896–1904 (2009)
Google Scholar
Zhang, M., Asanovic, K.: Victim Migration: Dynamically Adapting Between Private and Shared CMP Caches. MIT-CSAIL-TR-2005-064, MIT-LCS-TR-1006, October 10 (2005)
Google Scholar
Jouppi, N.P.: Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In: ISCA 1990 (1990)
Google Scholar
Topham, N., et al.: The design and performance of a conflict-avoiding cache. In: Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 71–80 (1997)
Google Scholar
Hardavellas, N., et al.: R-NUCA: Data Placement in Distributed Shared Caches. In: ISCA 2009 (2009)
Google Scholar
Kongetira, P., Aingaran, K., et al.: Niagara: a 32-Way Multithreaded Sparc Processor. In: HotChips’16 (2005)
Google Scholar
Woo, S.C., Ohara, M., et al.: The SPLASH-2 Programs: Characterization and Methodological Considerations. In: ISCA 1995, pp. 24–36 (June 1995)
Google Scholar
Kim, S., Chandra, D., Solihin, Y.: Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In: PACT 2004 (2004)
Google Scholar
Srikantaiah, S., Kandemir, M., Irwin, M.J.: Adaptive set pinning: Managing shared caches in chip multiprocessors. In: ASPLOS 2008 (2008)
Google Scholar
Fan, D., et al.: Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions. Journal of Computer Science and Technology 24(6), 1061–1073 (2009)
Article Google Scholar
Fu, Y., et al.: Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 20, 1948–1954 (2004)
Article Google Scholar
Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 (CACTI 6.0: A Tool to Model Large Caches.). In: Micro (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Fenglong Song, Zhiyong Liu, Dongrui Fan, Hao Zhang, Lei Yu & Shibin Tang

Authors

Fenglong Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongrui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shibin Tang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICAR-CNR, Via P. Castellino, 111, 80131, Napoli,, Italy
Pasqua D’Ambra
ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy
Mario Guarracino
ICAR-CNR, Via P. Bucci 41c, 87036, Rende, Italy
Domenico Talia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, F., Liu, Z., Fan, D., Zhang, H., Yu, L., Tang, S. (2010). Thread Owned Block Cache: Managing Latency in Many-Core Architecture. In: D’Ambra, P., Guarracino, M., Talia, D. (eds) Euro-Par 2010 - Parallel Processing. Euro-Par 2010. Lecture Notes in Computer Science, vol 6271. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15277-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-15277-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15276-4
Online ISBN: 978-3-642-15277-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Abstract

Chapter PDF

Similar content being viewed by others

SRCP: sharing and reuse-aware replacement policy for the partitioned cache in multicore systems

PUMA: From Simultaneous to Parallel for Shared Memory System in Multi-core

Shared write buffer to boost applications on SpMT architecture

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Abstract

Chapter PDF

Similar content being viewed by others

SRCP: sharing and reuse-aware replacement policy for the partitioned cache in multicore systems

PUMA: From Simultaneous to Parallel for Shared Memory System in Multi-core

Shared write buffer to boost applications on SpMT architecture

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation