Article

Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Authors:

Manuel E. Acacio,

José González,

José M. García,

José DuatoAuthors Info & Claims

SC '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing

Pages 1 - 12

Published: 16 November 2002 Publication History

Abstract

Cache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path of 3-hop misses, which significantly penalizes them compared to SMP designs. This work studies the use of owner prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for cache-to-cache transfer misses. Our proposal comprises an effective prediction scheme as well as a coherence protocol designed to support the use of prediction. Results indicate that owner prediction can significantly reduce the latency of cache-to-cache transfer misses, which translates into speed-ups on application performance up to 12%. In order to also accelerate most of those 3-hop misses that are either not predicted or mispredicted, the inclusion of a small and fast directory cache in every node is evaluated, leading to improvements up to 16% on the final performance.

References

[1]

M. E. Acacio, J. González, J. M. García and J. Duato. "A New Scalable Directory Architecture for Large-Scale Multiprocessors". Proc. of the 7th Int'l Symposium on High Performance Computer Architecture (HPCA-7), pp. 97--106, January 2001.

Digital Library

[2]

M. E. Acacio, J. González, J. M. García and J. Duato. "A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors". Proc. of the 16th Int'l Parallel and Distributed Processing Symposium (IPDPS'02), April 2002.

Digital Library

[3]

L. A. Barroso, K. Gharachorloo and E. Bugnion. "Memory System Characterization of Commercial Workloads". In Proc. of the 25th Int'l Symposium on Computer Architecture (ISCA'98), pp. 3--14, June 1998.

Digital Library

[4]

E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill and D. A. Wood. "Multicast Snooping: A New Coherence Method Using a Multicast Address Network". Proc. of the 26th Int'l Symposium on Computer Architecture (ISCA'99), pp. 294--304, May 1999.

Digital Library

[5]

A. Charlesworth. "Extending the SMP Envelope". IEEE Micro, 18(1):39--49, Jan/Feb 1998.

Digital Library

[6]

D. E. Culler, J. P. Singh and A. Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, Inc., 1999.

Digital Library

[7]

K. Gharachorloo, M. Sharma, S. Steely and S. V. Doren. "Architecture and Design of AlphaServer GS320". Proc. of International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS IX), pp. 13--24, November 2000.

Digital Library

[8]

A. González, M. Valero, N. Topham and J. M. Parcerisa. "Eliminating Cache Conflict Misses through XOR-Based Placement Functions". Proc. of the Int'l Conference on Su-percomputing (ICS'97), pp. 76--83, 1997.

Digital Library

[9]

A. Gupta, W.-D. Weber and T. Mowry. "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes". Proc. Int'l Conference on Parallel Processing (ICPP'90), pp. 312--321, August 1990.

[10]

L. Gwennap. "Alpha 21364 to Ease Memory Bottleneck". Microprocessor Report, pp. 12--15, October 1998.

[11]

M. D. Hill. "Multiprocessors Should Support Simple Memory-Consistency Models". IEEE Computer, 31(8):28--34, August 1998.

Digital Library

[12]

C. J. Hughes, V. S. Pai, P. Ranganathan and S. V. Adve. "RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors". IEEE Computer, 35(2):40--49, February 2002.

Digital Library

[13]

R. Iyer, L. N. Bhuyan and A. Nanda. "Using Switch Directories to Speed Up Cache-to-Cache Transfers in CC-NUMA Multiprocessors". Proc. of the 14th Int'l Parallel and Distributed Processing Symposium (IPDPS'00), pp. 721--728, May 2000.

Digital Library

[14]

S. Kaxiras and J. R. Goodman. "Improving CC-NUMA Performance Using Instruction-Based Prediction". Proc. of the 5th Int'l High Performance Computer Architecture (HPCA-5), pp. 161--170, January 1999.

Digital Library

[15]

S. Kaxiras and C. Young. "Coherence Communication Prediction in Shared-Memory Multiprocessors". Proc. of the 6th Int'l High Performance Computer Architecture (HPCA-6), pp. 156--167, January 2000.

[16]

A. C. Lai and B. Falsafi. "Memory Sharing Predictor: The Key to a Speculative DSM". Proc. of the 26th Int'l Symposium on Computer Architecture (ISCA'99), pp. 162--171, May 1999.

Digital Library

[17]

J. Laudon and D. Lenoski. "The SGI Origin: A ccNUMA Highly Scalable Server". Proc. of the 24th Int'l Symposium on Computer Architecture (ISCA'97), pp. 241--251, June 1997.

Digital Library

[18]

M. M. Michael and A. K. Nanda. "Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors". Proc. of the 5th Int'l Symposium on High Performance Computer Architecture (HPCA-5), pp. 142--151, January 1999.

Digital Library

[19]

S. S. Mukherjee and M. D. Hill. "Using Prediction to Accelerate Coherence Protocols". Proc. of the 25th Int'l Symposium on Computer Architecture (ISCA'98), pp. 179--190, July 1998.

Digital Library

[20]

B. O'Krafka and A. Newton. "An Empirical Evaluation of Two Memory-Efficient Directory Methods". Proc. of the 17th Int'l Symposium on Computer Architecture (ISCA'90), pp. 138--147, May 1990.

Digital Library

[21]

V. S. Pai, P. Ranganathan, H. Abdel-Shafi and S. Adve. "The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors". IEEE Transactions on Computers, 48(2):218--226, February 1999.

Digital Library

[22]

J. Singh, W.-D. Weber and A. Gupta. "SPLASH: Stanford Parallel Applications for Shared-Memory". Computer Architecture News, 20:5--44, March 1992.

Digital Library

[23]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta. "The SPLASH-2 Programs: Characterization and Methodological Considerations". Proc. of the 22nd Int'l Symposium on Computer Architecture (ISCA'95), pp. 24--36, June 1995.

Digital Library

[24]

Z. Zhang. "Architectural Sensitive Application Characterization: The Approach of High-Performance Index-Set (HP-Set)". Technical Report HPL-2001--75, HP Laboratories Palo Alto, March 2001.

Cited By

Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Dublish SNagarajan VTopham N(2016)Cooperative Caching for GPUsACM Transactions on Architecture and Code Optimization10.1145/300158913:4(1-25)Online publication date: 12-Dec-2016
https://dl.acm.org/doi/10.1145/3001589
Esteve ARos ARobles AGómez MDuato J(2016)TokenTLBProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926280(1-13)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926280
Show More Cited By

Index Terms

Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Reducing cache misses through programmable decoders

Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-...
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors
PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques

This work is focused on accelerating upgrade misses in cc-NUMA multiprocessors. These misses are caused by store instructions for which a read-only copy of the line is found in the L2 cache. Upgrade misses require a message sent from the missing node to ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing

November 2002

952 pages

ISBN:076951524X

Conference Chair:
Roscoe Giles
Boston University
,
Program Chair:
Daniel Reed
NCSA
,
Publications Chair:
Kathryn Kelley
Ohio Supercomputer Center

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 16 November 2002

Check for updates

Qualifiers

Article

Conference

SC '02

Sponsor:

SIGARCH
IEEE-CS

SC '02: International Conference for High Performance Computing, Networking, Storage and Analysis

16 11 2002

Maryland, Baltimore

Acceptance Rates

SC '02 Paper Acceptance Rate 67 of 230 submissions, 29%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
526
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alsop JNa WSinclair MGrayson SAdve S(2022)A Case for Fine-grain Coherence Specialization in Heterogeneous SystemsACM Transactions on Architecture and Code Optimization10.1145/353081919:3(1-26)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3530819
Dublish SNagarajan VTopham N(2016)Cooperative Caching for GPUsACM Transactions on Architecture and Code Optimization10.1145/300158913:4(1-25)Online publication date: 12-Dec-2016
https://dl.acm.org/doi/10.1145/3001589
Esteve ARos ARobles AGómez MDuato J(2016)TokenTLBProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926280(1-13)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926280
Van Laer AEllawala CMadarbux MWatts PJones TNebel WAtienza D(2015)Coherence based message prediction for optically interconnected chip multiprocessorsProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755892(613-616)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2755892
Musleh MPai VKern JVetter J(2015)Automatic sharing classification and timely push for cache-coherent systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807649(1-12)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2807591.2807649
Huang LWang ZXiao NWang YDou Q(2014)Integrated Coherence PredictionACM Transactions on Design Automation of Electronic Systems10.1145/261175619:3(1-22)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2611756
Kayi ASerres OEl-Ghazawi T(2014)Bandwidth Adaptive Cache Coherence Optimizations for Chip MultiprocessorsInternational Journal of Parallel Programming10.1007/s10766-013-0247-842:3(435-455)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s10766-013-0247-8
Demetriades SCho S(2012)Predicting Coherence Communication by Tracking Synchronization Points at Run TimeProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2012.40(351-362)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1109/MICRO.2012.40
Xu YDu YZhang YYang JLowenthal Dde Supinski BMcKee S(2011)A composite and scalable cache coherence protocol for large scale CMPsProceedings of the international conference on Supercomputing10.1145/1995896.1995941(285-294)Online publication date: 31-May-2011
https://dl.acm.org/doi/10.1145/1995896.1995941
Kayi AEl-Ghazawi TEl-Shishiny HRohou E(2010)An adaptive cache coherence protocol for chip multiprocessorsProceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies10.1145/1882453.1882458(1-10)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1145/1882453.1882458
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents