Lecture-7 SMP NUMA Cache Coherence
Lecture-7 SMP NUMA Cache Coherence
Lecture-7 SMP NUMA Cache Coherence
2. These processors share the same main memory and I/O facilities and are interconnected by a
bus or other internal connection scheme, such that memory access time is approximately the
same for each processor.
3. All processors share access to I/O devices, either through the same channels or through
different channels that provide paths to the same device.
4. All processors can perform the same functions (hence the term symmetric).
1. Performance: If the work to be done by a computer can be organized so that some portions of the
work can be done in parallel, then a system with multiple processors will yield greater performance
than one with a single processor of the same type.
2. Availability: In a symmetric multiprocessor, because all processors can perform the same functions,
the failure of a single processor does not halt the machine. Instead, the system can continue to
function at reduced performance.
3. Incremental growth: A user can enhance the performance of a system by adding an additional
processor.
4. Scaling: Vendors can offer a range of products with different price and performance characteristics
based on the number of processors configured in the system.
Disadvantages of Symmetric Multiprocessor (SMP)
❑Disadvantages of SMP.
1. Complexity: SMP systems are more complex than uniprocessor systems, and require additional
hardware and software to manage multiple processors and memory modules.
2. Higher cost: SMP systems are more expensive than uniprocessor systems, as they require additional
hardware and memory to support multiple processors.
3. Increased power consumption: SMP systems consume more power than uniprocessor systems, due
to the additional processors and memory modules.
4. Limited scalability: While SMP systems can be scaled up to a certain point, they may not be able to
handle extremely large workloads as efficiently as a distributed system.
5. Memory contention: In an SMP system, all processors share the same memory space, which can
lead to contention for memory resources and reduce system performance.
Tightly Coupled Multiprocessor
I/O
I/O
Interconnection
Network
I/O
Main Memory
❑ To facilitate DMA transfers from I/O subsystems to processors, the following features are
provided:
▪ Arbitration: Any I/O module can temporarily function as “master.” A mechanism is provided to
arbitrate competing requests for bus control, using some sort of priority scheme.
▪ Time-sharing: When one module is controlling the bus, other modules are locked out and must, if
necessary, suspend operation until bus access is achieved.
SMP Organization
shared bus
Main I/O
Memory I/O Adapter
Subsytem
I/O
Adapter
I/O
Adapter
P1 P1 P1 P1
None of the processor perform the write operation in the shared memory so, finally in the shared memory X = 10 and in
the local caches, X = 9, 11, 12, 13 therefore inconsistent view of memory. This is called cache coherence problem.
Cache Coherence Problem
❑Multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed
to update their own copies freely, an inconsistent view of memory can result.
❑Cache coherence refers to the problem of keeping the data in these caches consistent.
❑The main problem is dealing with writes by a processor. There are two general strategies for dealing
1. Write-through - all data written to the cache is also written to memory at the same time.
2. Write-back - Write operations are usually made only to the cache.. The modified block is
written to memory only when the block is replaced. It is clear that a write-back policy can result in
inconsistency.
Cache Coherence Problem
• Write-through caches are simpler, and they automatically deal with the cache coherence problem, but
they increase bus traffic significantly.
• Write-back caches are more common where higher performance is desired. The MESI cache
coherence protocol is one of the simpler write-back protocols.
The MESI Protocol
• To provide cache consistency on an SMP, the data cache often
supports a protocol known as MESI. For MESI, the data cache
includes two status bits per tag, so that each line can be in one
of four states:
• ■■ Modified: The line in the cache has been modified (different from main
memory) and is available only in this cache.
• ■■ Exclusive: The line in the cache is the same as that in main memory
and is not present in any other cache.
• ■■ Shared: The line in the cache is the same as that in main memory and
may be present in another cache.
• ■■ Invalid: The line in the cache does not contain valid data.
SHR
SHR
R
M
WM
R
E
M
WM
SHR
R SHR
SHW
SH
SHW
SH
W
H
SH
W
W
R
H
SH
W
RH Modified WH Exclusive RH Modified Exclusive
RH Modified WH Exclusive RH Modified Exclusive
WH
WH (a) Line in cache at initiating pr ocessor (b) Line in snooping cache
(a) Line in cache at initiating pr ocessor (b) Line in snooping cache
▪ Prevention:
▪ The simplest approach is to prevent any shared data variables from being cached. This is too
conservative, because a shared data structure may be exclusively used during some periods and may
be effectively read-only during other periods.
▪ More efficient approaches analyze the code to determine safe periods for shared variables. The
compiler then inserts instructions into the generated code to enforce cache coherence during the
critical periods.
Solution to Cache Coherence Problem
❖Hardware Solution
▪ Hardware solution provide dynamic recognition at run time of potential
inconsistency conditions. Because the problem is only dealt with when it
actually arises, there is more effective use of caches, leading to improved
performances over a software approach.
2. Snoopy protocols
H/W Solution to Cache Coherence Problem
❖Directory Based Protocol
▪ Directory protocols collect and maintain information about where copies of lines
reside.
▪Local memory
Solution to Cache Coherence Problem
❖Directory Based
Protocol
Directory Protocols
Effective in large scale
Collect and maintain
systems with complex
information about
interconnection
copies of data in cache
schemes
▪ When an individual cache controller makes a request, the centralized controller checks and
issues necessary commands for data transfer between memory and caches or between caches
themselves.
▪ It is also responsible for keeping the state information up to date, therefore, every local action
that can effect the global state of a line must be reported to the central controller.
▪ The controller maintains information about which processors have a copy of which lines.
▪ Before a processor can write to a local copy of a line, it must request exclusive access to the
▪ Before granting thus exclusive access, the controller sends a message to all processors
with a cached copy of this time, forcing each processors to invalidate its copy.
▪ After receiving acknowledgement back from each such processor, the controller grants
exclusive access to the requesting processor.
▪ When another processor tries to read a line that is exclusively granted to another
processors, it will send a miss notification to the controller.
▪ The controller then issues a command to the processor holding that line that requires the
processors to do a write back to main memory.
Solution to Cache Coherence Problem
❖Applications
▪ It is used in scalable systems such as developing CC-NUMA (Cache
Coherence Non-Uniform Memory Access) architecture.
Solution to Cache Coherence Problem
❖ Snoopy Bus Protocol
▪ Snoopy protocols distribute the responsibility for maintaining cache coherence among all of the cache controllers in a
multiprocessor system.
▪ A cache must recognize when a line that it holds is shared with other caches.
▪ When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast
mechanism.
▪ Each cache controller is able to “snoop” on the network to observed these broadcasted notification and react
accordingly.
▪ Snoopy protocols are ideally suited to a bus-based multiprocessor, because the shared bus provides a simple means
for broadcasting and snooping.
Solution to Cache Coherence Problem
❖ Snoopy Bus Protocol
1. Write invalidates
➢ When a local cache copy is modified, it invalidate all other remote copies of the caches
(invalidated items are also called ‘dirty’).
➢ When a local cache copy is modified it broadcasts the modified value of the data object
to all other caches at the time of modification.
Solution to Cache Coherence Problem
X X1
X X X X1 I I
P1 p2 p3 P1 p2 p3
X X1
X X X X1 X1 X1
P1 p2 p3 P1 p2 p3
▪ All processors have access to all parts of main memory using loads and stores. The memory
access time of a processor to all regions of memory is the same. The access times experienced by
different processors are the same. Example: SMP
▪ All processors have access to all parts of main memory using loads and stores. The memory
access time of a processor differs depending on which region of main memory is accessed.
Example: Cluster
▪ A NUMA system in which cache coherence is maintained among the caches of the various
processors.
Non-Uniform Memory Access (NUMA)
Processor Processor
1-1 1-m
L1 Cache L1 Cache
I/O
Main
Memory 1
Processor Processor
2-1 2-m
L1 Cache L1 Cache
Interconnect
Network L2 Cache L2 Cache Directory
I/O
Main
Memory 2
Processor Processor
N-1 N-m
L1 Cache L1 Cache
Directory
Main
Memory N
Figure: CC-NUMA
NUMA Pros and Cons
• The main advantage of a CC- NUMA system is that it can deliver effective
performance at higher levels of parallelism than SMP, without requiring
major software changes.
• The bus traffic on any individual node is limited to a demand that the bus can
handle.
• If many of the memory accesses are to remote nodes, performance begins to
break down.
• Even if the performance breakdown due to remote access is addressed, there
are two other disadvantages for the CC- NUMA approach.
• First, a CC-NUMA does not transparently look like an SMP; software
changes will be required to move an operating system and applications
from an SMP to a CC-NUMA system.
• A second concern is that of availability.
Suppose that processor 3 on node 2 (P2-3) requests a memory location 798, which is in the memory of node 1.
❑ : The following sequence occurs
1. P2-3 issues a read request on the snoopy bus of node 2
Processor Processor
for location 798. 1-1
L1 Cache
1-m
L1 Cache
Interconnect
Network L2 Cache L2 Cache Directory
N-1 N-m
L1 Cache L1 Cache
5. Node 1’s main memory responds by putting the requested L2 Cache L2 Cache I/O
6. Node 1’s directory picks up the data from the bus. Figure 17.11 CC-NUMA Organization
8. Node 2’s directory places the data back on node 2’s bus, acting as a surrogate for the memory that originally held it.
9. The value is picked up and placed in P2-3’s cache and delivered to P2-3.
Advantages of CC-NUMA