CH20 COA11e

Computer Organization and Architecture
Designing for Performance

11th Edition, Global Edition
Chapter 20
Parallel Processing
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Multiple Processor Organization
• Single instruction, single data • Multiple instruction, single data
(SISD) stream (MISD) stream
– Single processor executes a – A sequence of data is transmitted to
single instruction stream to a set of processors, each of which
operate on data stored in a executes a different instruction
single memory sequence
– Uniprocessors fall into this
– Not commercially implemented
category
• Single instruction, multiple • Multiple instruction, multiple data

data (SIMD) stream (MIMD) stream
– A single machine instruction – A set of processors
controls the simultaneous simultaneously execute different
execution of a number of instruction sequences on different
processing elements on a data sets
lockstep basis – SMPs, clusters and NUMA
– Vector and array processors systems fit this category
fall into this category

Figure 20.1
A Taxonomy of Parallel Processor
Architectures

Figure 20.2
Alternative Computer Organizations

Symmetric Multiprocessor (SMP)
A stand alone computer with the
following characteristics:
Processors All processors System
share same share access to controlled by
memory and I/O devices integrated
I/O facilities • Either through operating
Two or more • Processors are same channels or All processors system
similar connected by a different channels can perform the • Provides
processors of bus or other giving paths to same functions interaction
internal same devices
comparable (hence between
connection processors and
capacity • Memory access “symmetric”)
their programs at
time is job, task, file and
approximately the data element
same for each levels
processor

Figure 20.3
Multiprogramming and Multiprocessing

Figure 20.4
Generic Block Diagram of a Tightly Coupled
Multiprocessor

Figure 20.5
Symmetric Multiprocessor Organization

The bus organization has several
attractive features:
• Simplicity
– Simplest approach to multiprocessor organization
• Flexibility
– Generally easy to expand the system by attaching more
processors to the bus
• Reliability
– The bus is essentially a passive medium and the failure of
any attached device should not cause failure of the whole
system

Disadvantages of the bus organization:
• Main drawback is performance
– All memory references pass through the common bus
– Performance is limited by bus cycle time
• Each processor should have cache memory

– Reduces the number of bus accesses
• Leads to problems with cache coherence

– If a word is altered in one cache it could conceivably invalidate a
word in another cache
▪ To prevent this the other processors must be alerted that an update has
taken place
– Typically addressed in hardware rather than the operating system

Multiprocessor Operating System Design
Considerations
• Simultaneous concurrent processes
– OS routines need to be reentrant to allow several processors to execute the same IS code simultaneously
– OS tables and management structures must be managed properly to avoid deadlock or invalid operations
• Scheduling
– Any processor may perform scheduling so conflicts must be avoided
– Scheduler must assign ready processes to available processors
• Synchronization
– With multiple active processes having potential access to shared address spaces or I/O resources, care
must be taken to provide effective synchronization
– Synchronization is a facility that enforces mutual exclusion and event ordering
• Memory management
– In addition to dealing with all of the issues found on uniprocessor machines, the OS needs to exploit the
available hardware parallelism to achieve the best performance
– Paging mechanisms on different processors must be coordinated to enforce consistency when several
processors share a page or segment and to decide on page replacement
• Reliability and fault tolerance
– OS should provide graceful degradation in the face of processor failure
– Scheduler and other portions of the operating system must recognize the loss of a processor and restructure
accordingly

Cache Coherence (1 of 2)
Software Solutions
• Attempt to avoid the need for additional hardware circuitry
and logic by relying on the compiler and operating system to
deal with the problem
• Attractive because the overhead of detecting potential
problems is transferred from run time to compile time, and
the design complexity is transferred from hardware to
software
– However, compile-time software approaches generally must
make conservative decisions, leading to inefficient cache
utilization

Cache Coherence (2 of 2)
Hardware-Based Solutions
• Generally referred to as cache coherence protocols
• These solutions provide dynamic recognition at run time of potential

inconsistency conditions
• Because the problem is only dealt with when it actually arises there is
more effective use of caches, leading to improved performance over a
software approach
• Approaches are transparent to the programmer and the compiler,

reducing the software development burden
• Can be divided into two categories:

– Directory protocols
– Snoopy protocols
Directory Protocols
Effective in large scale
Collect and maintain
systems with complex
information about
interconnection
copies of data in cache
schemes
Directory stored in main Creates central

memory bottleneck
Requests are checked Appropriate transfers

against directory are performed

Snoopy Protocols
• Distribute the responsibility for maintaining cache coherence
among all of the cache controllers in a multiprocessor
– A cache must recognize when a line that it holds is shared with other caches
– When updates are performed on a shared cache line, it must be announced to
other caches by a broadcast mechanism
– Each cache controller is able to “snoop” on the network to observe these
broadcast notifications and react accordingly
• Suited to bus-based multiprocessor because the shared bus

provides a simple means for broadcasting and snooping
– Care must be taken that the increased bus traffic required for broadcasting
and snooping does not cancel out the gains from the use of local caches
• Two basic approaches have been explored:

– Write invalidate
– Write update (or write broadcast)
Write Invalidate
• Multiple readers, but only one writer at a time
• When a write is required, all other caches of the line are
invalidated
• Writing processor then has exclusive (cheap) access until line
is required by another processor
• Most widely used in commercial multiprocessor systems such
as the x86 architecture
• State of every line is marked as modified, exclusive, shared or
invalid
– For this reason the write-invalidate protocol is called MESI

Write Update
Can be multiple readers and writers
When a processor wishes to update a shared line the word

to be updated is distributed to all others and caches
containing that line can update it
Some systems use an adaptive mixture of both write-

invalidate and write-update mechanisms

MESI Protocol
To provide cache consistency on an SMP the data cache
supports a protocol known as MESI:
• Modified
– The line in the cache has been modified and is available only in
this cache
• Exclusive
– The line in the cache is the same as that in main memory and
is not present in any other cache
• Shared
– The line in the cache is the same as that in main memory and
may be present in another cache
• Invalid
– The line in the cache doesCopyright
not contain
© 2022valid data
Pearson Education, Ltd. All Rights Reserved
Table 20.1
MESI Cache Line States
M E S I
Modified Exclusive Shared Invalid
This cache line valid? Yes Yes Yes No
The memory copy is … out of date valid valid –
Copies exist in other caches? No No Maybe Maybe
does not go does not go goes to bus and goes directly

A write to this line …
to bus to bus updates cache to bus

Figure 20.6
MESI State Transition Diagram

Figure 20.7
Relationship Between Cache Lines in Cooperating
Caches

Read Miss
• Wen a read miss occurs in the local cache, the
processor initiates a memory read to read the line of
main memory containing the missing address
• The processor inserts a signal on the bus that alerts all
other processor/cache units to snoop the transaction
• There are a number of possible outcomes resulting from
this process

Read Hit
When a read hit occurs on a line currently in

the local cache, the processor simply reads the
required item
There is no state change
The state remains modified, shared, or

exclusive

Write Miss
• When a write miss occurs in the local cache, the processor
initiates a memory read to read the line of main memory
containing the missing address
• For this purpose, the processor issues a signal on the bus
that means read-with-intent-to-modify (RWITM)
• When the line is loaded, it is immediately marked modified
• With respect to other caches, two possible scenarios precede
the loading of the line of data
▪ Some other cache may have a modified copy of this line
▪ No other cache has a modified copy of the requested line

Write Hit
• When a write hit occurs on a line currently in the local cache, the
effect depends on the current state of that line in the local cache:
– Shared
– Before performing the update, the processor must gain exclusive
ownership of the line
– The processor signals its intent on the bus
– Each processor that has a shared copy of the line in its cache transitions
the sector from shared to invalid
– The initiating processor then performs the update and transitions its
copy of the line from shared to modified
– Exclusive
– The processor already has exclusive control of this line, and so it simply
performs the update and transitions its copy of the line from exclusive to
modified
– Modified
– The processor already has exclusive control of this line and has the line
marked as modified, and so it simply performs the update
Figure 20.8
Initiator Reads from Writeback Cache

Figure 20.9
Initiator Writes to Writeback Cache

Multithreading and Chip Multiprocessors
• Processor performance can be measured by the rate at which it
executes instructions
• MIPS rate = f * IPC
– f = processor clock frequency, in MHz
– IPC = average instructions per cycle
• Increase performance by increasing clock frequency and increasing

instructions that complete during cycle
• Multithreading
– Allows for a high degree of instruction-level parallelism without
increasing circuit complexity or power consumption
– Instruction stream is divided into several smaller streams, known as
threads, that can be executed in parallel

Definitions of Threads
and Processes Thread in multithreaded processors
may or may not be the same as the
concept of software threads in a
multiprogrammed operating system
Thread is concerned with

scheduling and execution,
Thread switch whereas a process is concerned
• The act of switching processor control with both scheduling/execution
between threads within the same process
• Typically less costly than process switch and resource and resource
ownership
Thread: Process:
• Dispatchable unit of work within a process • An instance of program running on
• Includes processor context (which includes computer
the program counter and stack pointer) and • Two key characteristics:
data area for stack • Resource ownership
• Executes sequentially and is interruptible so
• Scheduling/execution
that the processor can turn to another thread
Process switch
• Operation that switches the processor from one
process to another by saving all the process
control data, registers, and other information
for the first and replacing them with the
process information for the second

Implicit and Explicit Multithreading
• All commercial processors and most
experimental ones use explicit multithreading
– Concurrently execute instructions from different
explicit threads
– Interleave instructions from different threads on
shared pipelines or parallel execution on parallel
pipelines
• Implicit multithreading is concurrent execution

of multiple threads extracted from single
sequential program
– Implicit threads defined statically by compiler or
dynamically by hardware

Approaches to Explicit Multithreading
• Interleaved • Blocked
– Fine-grained – Coarse-grained
– Processor deals with two or – Thread executed until event
more thread contexts at a time causes delay
– Switching thread at each clock – Effective on in-order
cycle processor
– If thread is blocked it is skipped – Avoids pipeline stall
• Simultaneous (SMT) • Chip multiprocessing

– Instructions are simultaneously – Processor is replicated on a
issued from multiple threads to single chip
execution units of superscalar – Each processor handles
processor separate threads
– Advantage is that the
available logic area on a
chip is used effectively

Figure 20.10
Approaches to Executing Multiple Threads

Clusters
• Alternative to SMP as an approach to providing high performance and high availability
• Particularly attractive for server applications
• Defined as:
– A group of interconnected whole computers working together as a unified computing
resource that can create the illusion of being one machine
– (The term whole computer means a system that can run on its own, apart from the
cluster)
• Each computer in a cluster is called a node
• Benefits:
– Absolute scalability
– Incremental scalability
– High availability
– Superior price/performance
Figure 20.11
Cluster Configurations

Table 20.2
Clustering Methods: Benefits and Limitations
Clustering Method Description Benefits Limitations
Passive Standby A secondary server takes Easy to implement. High cost because the
over in case of primary secondary server is
server failure. unavailable for other
processing tasks.
Active Secondary: The secondary server is Reduced cost because Increased complexity.
also used for processing secondary servers can
tasks. be used for processing.
Separate Servers Separate servers have High availability. High network and server
their own disks. Data overhead due to copying
is continuously copied operations.
from primary to secondary
server.
Servers Connected to Servers are cabled to Reduced network and Usually requires disk
Disks the same disks, but each server overhead due to mirroring or RAID
server owns its disks. If elimination of copying technology to
one server fails, its disks operations. compensate
are taken over by the for risk of disk failure.
other server.
Servers Share Disks Multiple servers simul- Low network and server Requires lock manager
taneously share access overhead. Reduced risk software. Usually used
to disks. of downtime caused by with disk mirroring or
disk failure. RAID technology.

Nonuniform Memory Access (NUMA)
• Alternative to SMP and clustering

• Uniform memory access (UMA)
– All processors have access to all parts of main memory using loads and stores
– Access time to all regions of memory is the same
– Access time to memory for different processors is the same
• Nonuniform memory access (NUMA)

– All processors have access to all parts of main memory using loads and stores
– Access time of processor differs depending on which region of main memory is
being accessed
– Different processors access different regions of memory at different speeds
• Cache-coherent NUMA (CC-NUMA)

– A NUMA system in which cache coherence is maintained among the caches of the
various processors

Motivation
SMP has practical limit to In clusters each node has its
number of processors that can own private main memory
be used • Applications do not see a large global
• Bus traffic limits to between 16 and 64 memory
processors • Coherency is maintained by software
rather than hardware
Objective with NUMA is to

maintain a transparent system
NUMA retains SMP flavor
wide memory while permitting
while giving large scale
multiple multiprocessor nodes,
multiprocessing
each with its own bus or
internal interconnect system

Figure 20.12
CC-NUMA Organization

NUMA Pros and Cons
• Main advantage of a CC-
NUMA system is that it can
deliver effective performance • Does not transparently look
at higher levels of parallelism like an SMP
than SMP without requiring • Software changes will be
major software changes required to move an
operating system and
• Bus traffic on any individual applications from an SMP to
node is limited to a demand a CC-NUMA system
that the bus can handle • Concern with availability
• If many of the memory
accesses are to remote
nodes, performance begins to
break down
Summary • Parallel
Chapter 20 • Processing
• Multithreading and chip
• Multiple processor organizations
multiprocessors
– Types of parallel processor systems
– – Implicit and explicit multithreading
Parallel organizations
– Approaches to explicit
• Symmetric multiprocessors multithreading
– Organization
– Multiprocessor operating system design • Clusters
considerations – Cluster configurations
• Cache coherence and the MESI • Nonuniform memory access

protocol – Motivation
– Software solutions
– Organization
– Hardware solutions
– NUMA Pros and cons
– The MESI protocol

CH20 COA11e

Uploaded by

Copyright:

Available Formats

CH20 COA11e

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH20 COA11e

Uploaded by

Copyright:

Available Formats

Computer Organization and Architecture

Designing for Performance

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Single instruction, multiple • Multiple instruction, multiple data

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Each processor should have cache memory

• Leads to problems with cache coherence

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• These solutions provide dynamic recognition at run time of potential

• Approaches are transparent to the programmer and the compiler,

• Can be divided into two categories:

Directory stored in main Creates central

Requests are checked Appropriate transfers

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Suited to bus-based multiprocessor because the shared bus

• Two basic approaches have been explored:

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Can be multiple readers and writers

When a processor wishes to update a shared line the word

Some systems use an adaptive mixture of both write-

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

This cache line valid? Yes Yes Yes No

The memory copy is … out of date valid valid –

Copies exist in other caches? No No Maybe Maybe

does not go does not go goes to bus and goes directly

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

When a read hit occurs on a line currently in

There is no state change

The state remains modified, shared, or

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Increase performance by increasing clock frequency and increasing

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Thread is concerned with

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Implicit multithreading is concurrent execution

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Simultaneous (SMT) • Chip multiprocessing

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Particularly attractive for server applications

• Each computer in a cluster is called a node

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

Copyright © 2022 Pearson Education, Ltd. All Rights Reserved

• Alternative to SMP and clustering