Lecture-7 SMP NUMA Cache Coherence

Symmetric Multiprocessor (SMP)
❖If the processors share a common memory, then each processor

accesses programs and data stored in the shared memory, and
processors communicate with each other via that memory.
❖The most common form of such system is known as a symmetric

multiprocessor (SMP).
Characteristics of Symmetric Multiprocessor (SMP)
1. There are two or more similar processors of comparable capability.
2. These processors share the same main memory and I/O facilities and are interconnected by a
bus or other internal connection scheme, such that memory access time is approximately the
same for each processor.
3. All processors share access to I/O devices, either through the same channels or through
different channels that provide paths to the same device.
4. All processors can perform the same functions (hence the term symmetric).
5. The system is controlled by an integrated operating system that provides interaction

between processors and their programs at the job, task, file, and data element levels.
Advantages of Symmetric Multiprocessor (SMP)
❑Advantages of SMP over uniprocessor system:-
1. Performance: If the work to be done by a computer can be organized so that some portions of the
work can be done in parallel, then a system with multiple processors will yield greater performance
than one with a single processor of the same type.
2. Availability: In a symmetric multiprocessor, because all processors can perform the same functions,
the failure of a single processor does not halt the machine. Instead, the system can continue to
function at reduced performance.
3. Incremental growth: A user can enhance the performance of a system by adding an additional
processor.
4. Scaling: Vendors can offer a range of products with different price and performance characteristics
based on the number of processors configured in the system.
Disadvantages of Symmetric Multiprocessor (SMP)
❑Disadvantages of SMP.
1. Complexity: SMP systems are more complex than uniprocessor systems, and require additional
hardware and software to manage multiple processors and memory modules.
2. Higher cost: SMP systems are more expensive than uniprocessor systems, as they require additional
hardware and memory to support multiple processors.
3. Increased power consumption: SMP systems consume more power than uniprocessor systems, due
to the additional processors and memory modules.
4. Limited scalability: While SMP systems can be scaled up to a certain point, they may not be able to
handle extremely large workloads as efficiently as a distributed system.
5. Memory contention: In an SMP system, all processors share the same memory space, which can
lead to contention for memory resources and reduce system performance.
Tightly Coupled Multiprocessor
Processor Processor Processor
I/O
I/O
Interconnection
Network
I/O
Main Memory
Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor

Time Shared Bus
❑ The most common organization for personal computers, workstations, and servers is the
time- shared bus.
❑The time-shared bus is the simplest mechanism for constructing a multiprocessor system.
❑ To facilitate DMA transfers from I/O subsystems to processors, the following features are
provided:
❑ Features of Time Shared Bus:-

▪ Addressing: It must be possible to distinguish modules on the bus to determine the source and
destination of data.
▪ Arbitration: Any I/O module can temporarily function as “master.” A mechanism is provided to
arbitrate competing requests for bus control, using some sort of priority scheme.
▪ Time-sharing: When one module is controlling the bus, other modules are locked out and must, if
necessary, suspend operation until bus access is achieved.
SMP Organization
Processor Processor Processor

L1 Cache L1 Cache L1 Cache
L2 Cache L2 Cache L2 Cache
shared bus
Main I/O
Memory I/O Adapter
Subsytem
I/O
Adapter
I/O
Adapter
Figure 17.5 Symmetric Multiprocessor Organization

Advantages of Time Shared Bus
❑ The bus organization has several attractive features:
▪ Simplicity: This is the simplest approach to multiprocessor organization. The
physical interface and the addressing, arbitration, and time-sharing logic of each
processor remain the same as in a single-processor system.
▪ Flexibility: It is generally easy to expand the system by attaching more processors to
the bus.
▪ Reliability: The bus is essentially a passive medium, and the failure of any attached
device should not cause failure of the whole system.
Disadvantages of Time Shared Bus
❑ The main drawback to the bus organization is performance.
❑All memory references pass through the common bus.
❑Thus, the bus cycle time limits the speed of the system.
❑ To improve performance, it is desirable to equip each processor with a
cache memory. This also introduce a new problem called cache coherence
problem.
❑This should reduce the number of bus accesses dramatically.
Multiprocessor OS Design Considerations
❑Simultaneous concurrent processes
❑OS routines need to be reentrant to allow several processors to execute the same IS code
simultaneously
❑OS tables and management structures must be managed properly to avoid deadlock or
invalid operations
❑Scheduling
❑Any processor may perform scheduling so conflicts must be avoided
❑Scheduler must assign ready processes to available processors
❑Synchronization
❑With multiple active processes having potential access to shared address spaces or I/O
resources, care must be taken to provide effective synchronization
❑Synchronization is a facility that enforces mutual exclusion and event ordering
Multiprocessor OS Design Considerations
❑Memory management
❑In addition to dealing with all of the issues found on uniprocessor machines, the
OS needs to exploit the available hardware parallelism to achieve the best
performance
❑Paging mechanisms on different processors must be coordinated to enforce
consistency when several processors share a page or segment and to decide on
page replacement
❑Reliability and fault tolerance
❑OS should provide graceful degradation in the face of processor failure
❑Scheduler and other portions of the operating system must recognize the loss of a
processor and restructure accordingly
Cache Coherence Problem
Main Memory, X =10
Cache 1 Cache 2 Cache 3 Cache 4

X=9 X = 11 X = 12 X = 13
P1 P1 P1 P1
X = X-1 X = X+1 X = X+2 X = X+3

X = 10-1 = 9 X = 10+1 = 11 X = 10+2 = 12 X = 10+3 = 13
None of the processor perform the write operation in the shared memory so, finally in the shared memory X = 10 and in
the local caches, X = 9, 11, 12, 13 therefore inconsistent view of memory. This is called cache coherence problem.
❑Multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed
to update their own copies freely, an inconsistent view of memory can result.
❑Cache coherence refers to the problem of keeping the data in these caches consistent.
❑The main problem is dealing with writes by a processor. There are two general strategies for dealing
with writes to a cache:
1. Write-through - all data written to the cache is also written to memory at the same time.
2. Write-back - Write operations are usually made only to the cache.. The modified block is
written to memory only when the block is replaced. It is clear that a write-back policy can result in
inconsistency.
The write-through protocols

have two versions and those
are:
1.Updating Values in Other
Caches
2.Invalidating Values in Other
Caches.
• Write-through caches are simpler, and they automatically deal with the cache coherence problem, but
they increase bus traffic significantly.
• Write-back caches are more common where higher performance is desired. The MESI cache
coherence protocol is one of the simpler write-back protocols.
The MESI Protocol
• To provide cache consistency on an SMP, the data cache often
supports a protocol known as MESI. For MESI, the data cache
includes two status bits per tag, so that each line can be in one
of four states:
• ■■ Modified: The line in the cache has been modified (different from main
memory) and is available only in this cache.
• ■■ Exclusive: The line in the cache is the same as that in main memory
and is not present in any other cache.
• ■■ Shared: The line in the cache is the same as that in main memory and
may be present in another cache.
• ■■ Invalid: The line in the cache does not contain valid data.
SHR
SHR
Invalid RMS Shared RH Invalid SHW Shared

Invalid RMS Shared RH Invalid SHW Shared
R
M
WM
R
E
M
WM
SHR
R SHR
SHW
SH
SHW
SH
W
H
SH
W
W
R
H
SH
W
RH Modified WH Exclusive RH Modified Exclusive
RH Modified WH Exclusive RH Modified Exclusive
WH
WH (a) Line in cache at initiating pr ocessor (b) Line in snooping cache
(a) Line in cache at initiating pr ocessor (b) Line in snooping cache
Dirty line copyback

RH Read hit Dirty line copyback
RH Read hit RMS Read miss, shared
RMS Read miss, sharedRME Read miss, exclusive Invalidate transaction
RME Read miss, exclusive
WH Write hit Invalidate transaction
WH Write hit WM Write miss
WM Write miss Read-with-intent-to-modify
SHR Snoop hit onRead-with-intent-to-modify
read
SHR Snoop hit on readSHW Snoop hit on write or
SHW Snoop hit on write or read-with-intent-to-modify Cache line fill
read-with-intent-to-modify Cache line fill
Figure 17.6 MESI State Transition Diagram

Figure 17.6 MESI State Transition Diagram
Solution to Cache Coherence Problem
❖There are two solutions of cache coherence problem:-
❖Software Solution /Cache Coherence Scheme
❖Hardware Solution/ Cache Coherence Protocol
1. Software Solution (Cache Coherence Scheme)
▪ Software approaches are attractive, because-

▪ Overhead of detecting of potential cache coherence problem is
transferred from run time to compile time.
▪ The design complexity is transferred from hardware to software.

S/W Solution to Cache Coherence Problem
▪ Nonattractive nature:
▪ In compile time; software approaches generally make conservative decisions. Leading to inefficient
cache utilization.
▪ Compiler-based cache coherence mechanism perform an analysis on the code to determine which
data items may become unsafe for caching, and they mark those items accordingly. So, there are
some more cacheable items, and the operating system or hardware does not cache those items.
▪ Prevention:
▪ The simplest approach is to prevent any shared data variables from being cached. This is too
conservative, because a shared data structure may be exclusively used during some periods and may
be effectively read-only during other periods.
▪ More efficient approaches analyze the code to determine safe periods for shared variables. The
compiler then inserts instructions into the generated code to enforce cache coherence during the
critical periods.
❖Hardware Solution
▪ Hardware solution provide dynamic recognition at run time of potential
inconsistency conditions. Because the problem is only dealt with when it
actually arises, there is more effective use of caches, leading to improved
performances over a software approach.
▪ Hardware schemes can be divided into two categories:

1. Directory protocol
2. Snoopy protocols
H/W Solution to Cache Coherence Problem
❖Directory Based Protocol
▪ Directory protocols collect and maintain information about where copies of lines
reside.
▪Typically, there is centralized controller that is
part of the main memory controller, and
▪A directory that is stored in main memory. The
directory contains global state information about Scalable
the contents of the various local caches.
▪Local memory
❖Directory Based
Protocol
Directory Protocols
Effective in large scale
Collect and maintain
systems with complex
information about
interconnection
copies of data in cache
schemes
Directory stored in Creates central

main memory bottleneck
Requests are checked Appropriate transfers

against directory are performed
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

reserved.
H/W Solution to Cache Coherence Problem
❖Directory Based Protocol
▪ When an individual cache controller makes a request, the centralized controller checks and
issues necessary commands for data transfer between memory and caches or between caches
themselves.
▪ It is also responsible for keeping the state information up to date, therefore, every local action
that can effect the global state of a line must be reported to the central controller.
▪ The controller maintains information about which processors have a copy of which lines.
▪ Before a processor can write to a local copy of a line, it must request exclusive access to the
line from the controller.

❖ Directory Based Protocol
▪ Before granting thus exclusive access, the controller sends a message to all processors
with a cached copy of this time, forcing each processors to invalidate its copy.
▪ After receiving acknowledgement back from each such processor, the controller grants
exclusive access to the requesting processor.
▪ When another processor tries to read a line that is exclusively granted to another
processors, it will send a miss notification to the controller.
▪ The controller then issues a command to the processor holding that line that requires the
processors to do a write back to main memory.
❖Disadvantages of Directory Based Protocol

▪ Directory schemes suffer from the drawbacks of a central bottleneck and
the overhead of communication between the various cache controllers and
the central controller.
❖Applications
▪ It is used in scalable systems such as developing CC-NUMA (Cache
Coherence Non-Uniform Memory Access) architecture.
❖ Snoopy Bus Protocol
▪ Snoopy protocols distribute the responsibility for maintaining cache coherence among all of the cache controllers in a
multiprocessor system.
▪ A cache must recognize when a line that it holds is shared with other caches.
▪ When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast
mechanism.
▪ Each cache controller is able to “snoop” on the network to observed these broadcasted notification and react
accordingly.
▪ Snoopy protocols are ideally suited to a bus-based multiprocessor, because the shared bus provides a simple means
for broadcasting and snooping.
❖ Snoopy Bus Protocol
▪ Two basic approaches to the snoopy protocol have been explored:-
1. Write invalidates
➢ When a local cache copy is modified, it invalidate all other remote copies of the caches
(invalidated items are also called ‘dirty’).
2. Write- update (write-broadcast)
➢ When a local cache copy is modified it broadcasts the modified value of the data object
to all other caches at the time of modification.
X X1
X X X X1 I I
P1 p2 p3 P1 p2 p3
Write Invalidate (P1 updates it cache from X to X1)

X X1
X X X X1 X1 X1
P1 p2 p3 P1 p2 p3
Write Broadcast(P1 updates it cache from X to X1)

Non-Uniform Memory Access (NUMA)
❑ Uniform Memory Access (UMA)
▪ All processors have access to all parts of main memory using loads and stores. The memory
access time of a processor to all regions of memory is the same. The access times experienced by
different processors are the same. Example: SMP
❑ Non-Uniform Access (NUMA)
▪ All processors have access to all parts of main memory using loads and stores. The memory
access time of a processor differs depending on which region of main memory is accessed.
Example: Cluster
❑Cache Coherence NUMA (CC-NUMA)
▪ A NUMA system in which cache coherence is maintained among the caches of the various
processors.
Non-Uniform Memory Access (NUMA)
Processor Processor
1-1 1-m
L1 Cache L1 Cache
L2 Cache L2 Cache Directory
I/O
Main
Memory 1
Processor Processor
2-1 2-m
L1 Cache L1 Cache
Interconnect
Network L2 Cache L2 Cache Directory
I/O
Main
Memory 2
Processor Processor
N-1 N-m
L1 Cache L1 Cache
L2 Cache L2 Cache I/O
Directory
Main
Memory N
Figure 17.11 CC-NUMA Organization
Figure: CC-NUMA
NUMA Pros and Cons
• The main advantage of a CC- NUMA system is that it can deliver effective
performance at higher levels of parallelism than SMP, without requiring
major software changes.
• The bus traffic on any individual node is limited to a demand that the bus can
handle.
• If many of the memory accesses are to remote nodes, performance begins to
break down.
• Even if the performance breakdown due to remote access is addressed, there
are two other disadvantages for the CC- NUMA approach.
• First, a CC-NUMA does not transparently look like an SMP; software
changes will be required to move an operating system and applications
from an SMP to a CC-NUMA system.
• A second concern is that of availability.
Suppose that processor 3 on node 2 (P2-3) requests a memory location 798, which is in the memory of node 1.
❑ : The following sequence occurs
1. P2-3 issues a read request on the snoopy bus of node 2
Processor Processor
for location 798. 1-1
L1 Cache
1-m
L1 Cache
L2 Cache L2 Cache Directory

2. The directory on node 2 sees the request and recognizes
that the location is in node 1. Main
Memory 1
I/O
3. Node 2’s directory sends a request to node 1, which is

Processor Processor
2-1 2-m
picked up by node 1’s directory. L1 Cache L1 Cache
Interconnect
Network L2 Cache L2 Cache Directory
4. Node 1’s directory, acting as a surrogate of P2-3,

I/O
Main
requests the contents of 798, as if it were a processor. Processor Processor

Memory 2
N-1 N-m
L1 Cache L1 Cache
5. Node 1’s main memory responds by putting the requested L2 Cache L2 Cache I/O
data on the bus. Directory

Main
Memory N
6. Node 1’s directory picks up the data from the bus. Figure 17.11 CC-NUMA Organization
7. The value is transferred back to node 2’s directory.
8. Node 2’s directory places the data back on node 2’s bus, acting as a surrogate for the memory that originally held it.
9. The value is picked up and placed in P2-3’s cache and delivered to P2-3.
Advantages of CC-NUMA
❑ Main advantage of a CC-NUMA system is that it can deliver

effective performance at higher levels of parallelism than SMP
without requiring major software changes.

Lecture-7 SMP NUMA Cache Coherence

Uploaded by

Copyright:

Available Formats

Lecture-7 SMP NUMA Cache Coherence

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-7 SMP NUMA Cache Coherence

Uploaded by

Copyright:

Available Formats

Symmetric Multiprocessor (SMP)

❖If the processors share a common memory, then each processor

❖The most common form of such system is known as a symmetric

5. The system is controlled by an integrated operating system that provides interaction

Processor Processor Processor

Figure 17.4 Generic Block Diagram of a Tightly Coupled Multiprocessor

❑ Features of Time Shared Bus:-

Processor Processor Processor

L2 Cache L2 Cache L2 Cache

Figure 17.5 Symmetric Multiprocessor Organization

Cache 1 Cache 2 Cache 3 Cache 4

X = X-1 X = X+1 X = X+2 X = X+3

with writes to a cache:

The write-through protocols

Invalid RMS Shared RH Invalid SHW Shared

Dirty line copyback

Figure 17.6 MESI State Transition Diagram

❖Hardware Solution/ Cache Coherence Protocol

1. Software Solution (Cache Coherence Scheme)

▪ Software approaches are attractive, because-

▪ The design complexity is transferred from hardware to software.

▪ Hardware schemes can be divided into two categories:

▪Typically, there is centralized controller that is

part of the main memory controller, and

▪A directory that is stored in main memory. The

directory contains global state information about Scalable

the contents of the various local caches.

Directory stored in Creates central

Requests are checked Appropriate transfers

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights

line from the controller.

❖Disadvantages of Directory Based Protocol

▪ Two basic approaches to the snoopy protocol have been explored:-

2. Write- update (write-broadcast)

Write Invalidate (P1 updates it cache from X to X1)

Write Broadcast(P1 updates it cache from X to X1)

❑ Non-Uniform Access (NUMA)

❑Cache Coherence NUMA (CC-NUMA)

L2 Cache L2 Cache Directory

L2 Cache L2 Cache I/O

Figure 17.11 CC-NUMA Organization

L2 Cache L2 Cache Directory

3. Node 2’s directory sends a request to node 1, which is

4. Node 1’s directory, acting as a surrogate of P2-3,

requests the contents of 798, as if it were a processor. Processor Processor

data on the bus. Directory

7. The value is transferred back to node 2’s directory.

❑ Main advantage of a CC-NUMA system is that it can deliver

You might also like