Nothing Special   »   [go: up one dir, main page]

Lec15 Snoop Coherence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Lecture  

15:  Snoopy  Coherence  Protocols

Vassilis Papaefstathiou
Iakovos Mavroidis

Computer  Science  Department


University  of  Crete
Where  are  We  Now?

Processor Processor
Output Output
Control Control
Memory Memory

Datapath Input Input Datapath

q Multiprocessor – multiple  processors  with  a  single  shared  


address  space
q Cluster – multiple  computers  (each  with  their  own  
address  space)  connected  over  a  local  area  network  
(LAN)  functioning  as  a  single  system
Multiprocessor  Basics
q Q1  – How  do  they  share  data?

q Q2  – How  do  they  coordinate?

q Q3  – How  scalable  is  the  architecture?    How  many      


processors?

#  of  Proc
Communication   Message  passing 8  to  2048
model Shared   NUMA 8  to  256
address UMA 2  to  64
Physical   Network 8  to  256
connection Bus 2  to  36
Single  Bus  (Shared  Address  UMA)  Multi’s
Proc1 Proc2 Proc3 Proc4

Caches Caches Caches Caches

Single  Bus

Memory I/O

q Caches  are  used  to  reduce  latency and  to  lower  bus  traffic
● Write-­back  caches  used  to  keep  bus  traffic  at  a  minimum
q Must  provide  hardware  to  ensure  that  caches  and  memory  
are  consistent  (cache  coherency)
q Must  provide  a  hardware  mechanism  to  support  process  
synchronization
Multiprocessor  Cache  Coherency
q Cache  coherency  protocols
● Bus  snooping   – cache  controllers  monitor  shared  bus  traffic  with  
duplicate  address  tag  hardware  (so  they  don’t  interfere  with  
processor’s  access  to  the  cache)  

Proc1 Proc2 ProcN

Snoop DCache Snoop DCache Snoop DCache

Single  Bus

Memory I/O
Bus  Snooping  Protocols
q Multiple  copies  are  not  a  problem  when  reading
q Processor  must  have  exclusive access  to  write  a  word
● What  happens  if  two  processors  try  to  write  to  the  same  shared  
data  word  in  the  same  clock  cycle?  The  bus  arbiter  decides  
which  processor  gets  the  bus  first  (and  this  will  be  the  
processor  with  the  first exclusive  access).    Then  the  second  
processor  will  get  exclusive  access.    Thus,  bus  arbitration  
forces  sequential behavior.
● This  sequential  consistency is  the  most  conservative  of  the  
memory  consistency  models.    With  it,  the  result  of  any  
execution  is  the  same  as  if  the  accesses  of  each  processor  
were  kept  in  order  and  the  accesses  among  different  
processors  were  interleaved.
q All  other  processors  sharing  that  data  must  be  informed  
of  writes
Handling  Writes
Ensuring  that  all  other  processors  sharing  data  are  
informed  of  writes  can  be  handled  two  ways:
1. Write-­update (write-­broadcast)  – writing  processor  
broadcasts  new  data  over  the  bus,  all  copies  are  
updated
● All  writes  go  to  the  bus  ® higher  bus  traffic
● Since  new  values  appear  in  caches  sooner,  can  reduce  latency
2. Write-­invalidate – writing  processor  issues  invalidation  
signal  on  bus,  cache  snoops  check  to  see  if  they  have  a  
copy  of  the  data,  if  so  they  invalidate  their  cache  block  
containing  the  word  (this  allows  multiple  readers  but  
only  one  writer)
● Uses  the  bus  only  on  the  first write  ® lower  bus  traffic,  so  better  
use  of  bus  bandwidth
A  Write-­Invalidate  CC  Protocol
read  (hit  or  
miss)
read  (miss)
Shared
Invalid
(clean)
write  (miss)

write-­back  caching  
Modified protocol  in  black
(dirty)

read  (hit)  or      


write  (hit  or  miss)
A  Write-­Invalidate  CC  Protocol
read  (hit  or  
read  (miss) miss)
Shared
Invalid receives  invalidate
(clean)
(write  by  another  processor
to  this  block)
another  processor
send  invalidate
write  (miss)

write  (miss)  by

to  this  block

Modified write-­back  caching  


(dirty) protocol  in  black
signals  from  the  processor  
coherence  additions  in  red
read  (hit)  or  write  (hit) signals  from  the  bus  
coherence  additions  in  
blue
Write-­Invalidate  CC  Examples
● I  =  invalid  (many),  S  =  shared  (many),  M  =  modified  (only  one)

Proc  1 Proc  2 Proc  1 Proc  2

A      S A      I A      S A      I  

Main  Mem Main  Mem


A         A        

Proc  1 Proc  2 Proc  1 Proc  2

A      M A      I A      M A      I  

Main  Mem Main  Mem


A         A        
Write-­Invalidate  CC  Examples
● I  =  invalid  (many),  S  =  shared  (many),  M  =  modified  (only  one)
3.  snoop  sees   1.  read  miss  for  A 1.  write  miss  for  A
read  request  for  
Proc  1 Proc  2 Proc  1 Proc  2
A  &  lets  MM   4.  gets  A  from  MM   4.  writes  A  &  
supply  A &  changes  its  state   3.  change  A   changes  its  state  
A      S A      I to  S A      S
state  to  I A      I   to  M

2.  read  request  for  A 2.  P2  sends  invalidate  for    A


Main  Mem Main  Mem
A         A        

3.  snoop  sees  read   1.  read  miss  for  A 1.  write  miss  for  A


request  for   A,  writes-­ Proc  2
Proc  1 Proc  1 Proc  2
back  A  to  MM   4.  gets  A  from  MM   4.  writes  A  &  
changes  it  state  to  S &  changes  its  state   3.  change  A   changes  its  state  
A      M A      I to  S A      M
state  to  I A      I   to  M

2.  read  request  for  A 2.  P2  sends  invalidate  for    A


Main  Mem Main  Mem
A         A        
SMP  Data  Miss  Rates
q Shared  data  has  lower  spatial  and  temporal  locality
● Share  data  misses  often  dominate  cache  behavior  even  though  
they  may  only  be  10%  to  40%  of  the  data  accesses
Capacity  miss  rate
64KB  2-­way  set  associative  
Coherence  miss  rate
data  cache  with  32B  blocks 18
16
Hennessy  &  Patterson,  Computer   14
Architecture:  A  Quantitative  Approach
12
Capacity  miss  rate
Coherence  miss  rate
10
8 8
6 6
4 4
2 2
0 0
1 2 4 8 16
FFT
1 2 4 8 16
Ocean
Block  Size  Effects
q Writes  to  one  word  in  a  multi-­word  block  mean
● either  the  full  block  is  invalidated  (write-­invalidate)
● or  the  full  block  is  exchanged  between  processors  (write-­update)
-­ Alternatively,  could  broadcast  only the  written  word

q Multi-­word  blocks  can  also  result  in  false  sharing:    when  


two  processors  are  writing  to  two  different  variables  in  
the  same  cache  block
● With  write-­invalidate  false  sharing  increases  cache  miss  rates
Proc1 Proc2

A B 4  word  cache  block

q Compilers  can  help  reduce  false  sharing  by  allocating  


highly  correlated  data  to  the  same  cache  block
MESI  Protocol  (1)
q There  are  many  variations  on  cache  coherence  protocols

q Another  write-­invalidate  protocol  used  in  the  Pentium  4  


(and  many  other  micro’s)  is  MESI with  four  states:
● Modified  – (same)  only  modified  cache  copy  is  up-­to-­date;;  
memory  copy  and  all  other  cache  copies  are  out-­of-­date
● Exclusive  – only  one  copy  of  the  shared  data  is  allowed  to  be  
cached;;  memory  has  an  up-­to-­date  copy
-­ Since  there  is  only  one  copy  of  the  block,  write  hits  don’t  need  to  
send  invalidate  signal
● Shared  – multiple  copies  of  the  shared  data  may  be  cached  (i.e.,  
data  permitted  to  be  cached  with  more  than  one  processor);;  
memory  has  an  up-­to-­date  copy
● Invalid  – same  
MESI  Protocol  (2)
q Cache  line  changes  state  as  a  function  of  memory  
access  events.
q Event  may  be  either
● Due  to  local  processor  activity  (i.e.  cache  access)
● Due  to  bus  activity  -­ as  a  result  of  snooping
q Cache  line  has  its  own  state  affected  only  if  address  
matches
MESI  Protocol  (3)
q Operation  can  be  described  informally  by  looking  at  
action  in  local  processor
● Read  Hit
● Read  Miss
● Write  Hit
● Write  Miss
q More  formally  by  state  transition  diagram
MESI  Local  Read  Hit
q Line  must  be  in  one  of  MES
q This  must  be  correct  local  value  (if  M  it  must  have  been  
modified  locally)
q Simply  return  value
q No  state  change
MESI  Local  Read  Miss  (1)
q No  other  copy  in  caches
● Processor  makes  bus  request  to  memory

● Value  read  to  local  cache,  marked  E

q One  cache  has  E  copy


● Processor  makes  bus  request  to  memory

● Snooping  cache  puts  copy  value  on  the  bus

● Memory  access  is  abandoned

● Local  processor  caches  value

● Both  lines  set  to  S


MESI  Local  Read  Miss  (2)
q Several  caches  have  S  copy
● Processor  makes  bus  request  to  memory
● One  cache  puts  copy  value  on  the  bus  (arbitrated)
● Memory  access  is  abandoned
● Local  processor  caches  value
● Local  copy  set  to  S
● Other  copies  remain  S
MESI  Local  Read  Miss  (3)
q One  cache  has  M  copy
● Processor  makes  bus  request  to  memory
● Snooping  cache  puts  copy  value  on  the  bus
● Memory  access  is  abandoned
● Local  processor  caches  value
● Local  copy  tagged  S
● Source  (M)  value  copied  back  to  memory
● Source  value  M  -­>  S
MESI  Local  Write  Hit  (1)
Line  must  be  one  of  MES
q M
● line  is  exclusive  and  already  ‘dirty’
● Update  local  cache  value
● no  state  change
q E
● Update  local  cache  value
● State  E  -­>  M
MESI  Local  Write  Hit  (2)
q S
● Processor  broadcasts  an  invalidate  on  bus
● Snooping  processors  with  S  copy  change  S-­>I
● Local  cache  value  is  updated
● Local  state  change  S-­>M
MESI  Local  Write  Miss  (1)
Detailed  action  depends  on  copies  in  other  processors

q No  other  copies
● Value  read  from  memory  to  local  cache  (?)
● Value  updated
● Local  copy  state  set  to  M
MESI  Local  Write  Miss  (2)
q Other  copies,  either  one  in  state  E  or  more  in  state  S
● Value  read  from  memory  to  local  cache  -­ bus  transaction  marked  
RWITM  (read  with  intent  to  modify)
● Snooping  processors  see  this  and  set  their  copy  state  to  I
● Local  copy  updated  &  state  set  to  M
MESI  Local  Write  Miss  (3)
Another  copy  in  state  M
q Processor  issues  bus  transaction  marked  RWITM
q Snooping  processor  sees  this
● Blocks  RWITM  request
● Takes  control  of  bus
● Writes  back  its  copy  to  memory
● Sets  its  copy  state  to  I
MESI  Local  Write  Miss  (4)
Another  copy  in  state  M  (continued)
q Original  local  processor  re-­issues  RWITM  request
q Is  now  simple  no-­copy  case
● Value  read  from  memory  to  local  cache
● Local  copy  value  updated
● Local  copy  state  set  to  M
Putting  it  all  together
q All  of  this  information  can  be  described  compactly    using  
a  state  transition  diagram
q Diagram  shows  what  happens  to  a  cache  line  in  a  
processor  as  a  result  of
● memory  accesses  made  by  that  processor  (read  hit/miss,  write  
hit/miss)
● memory  accesses  made  by  other  processors  that  result  in  bus  
transactions  observed  by  this  snoopy  cache  (Mem  read,  
RWITM,Invalidate)
MESI  – locally  initiated  accesses

Read
Miss(SH) Read
Invalid Mem  Read Shared Hit

Mem  Read
Read Invalidate
RWITM Miss(EX) Write
Write Hit
Miss

Read Read
Modified Exclusive Hit
Hit Write
Hit

Write =  bus  transaction


Hit
MESI  – remotely  initiated  accesses

Mem  Read

Invalidate

Invalid Shared

Mem  Read
RWITM Mem  Read RWITM

Modified Exclusive

=  copy  back
MESI  notes

q There  are  minor  variations  (particularly  to  do  with  write  


miss)
q Normal  ‘write  back’  when  cache  line  is  evicted  is  done  
if  line  state  is  M
q Multi-­level  caches
● If  caches  are  inclusive,  only  the  lowest  level  cache  needs  to  
snoop  on  the  bus

You might also like