Parallel Processors From Client To Cloud: Omputer Rganization and Esign

COMPUTER ORGANIZATION AND DESIGN 5th
The Hardware/Software Interface Edition
Parallel Processors from

Client to Cloud
Introduction
 Goal: connecting multiple computers
to get higher performance
 Multiprocessors
 Scalability, availability, power efficiency
 Task-level (process-level)
parallelism
 High throughput for independent jobs
 Parallel processing program
 Single program run on multiple
processors
 Multicore microprocessors
 Chips with multiple processors (cores)
Parallelism Basics- Introduction
 Parallelism and Instructions
 Synchronization
 Parallelism and Computer Arithmetic
 Subword Parallelism
 Parallelism and Advanced
Instruction-Level Parallelism
 Parallelism and Memory
Hierarchies
 Cache Coherence
Parallel Computers
Definition: “A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Questions about parallel computers:

How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How are data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?
What level Parallelism?
Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism (ILP):
~1985 through today
Pipelining
Superscalar
VLIW
Out-of-Order execution
Limits to benefits of ILP?
Process Level or Thread level parallelism;
mainstream for general purpose computing?
Servers are parallel
High-end Desktop dual processor PC
Why Multiprocessors?
1. Microprocessors as the fastest CPUs
• Collecting several much easier than redesigning 1
2. Complexity of current microprocessors
• Do we have enough ideas to sustain 2X/1.5yr?
• Can we deliver such complexity on schedule?
• Limit to ILP due to data dependency
3. Slow (but steady) improvement in parallel
software (scientific apps, databases, OS)
4. Emergence of embedded and server markets
driving microprocessors in addition to desktops
• Embedded functional parallelism
• Network processors exploiting packet-level parallelism
• SMP Servers and cluster of workstations for multiple users –
Less demand for parallel computing
Instruction-Level Parallelism (ILP)
Pipelining: executing multiple instructions in
parallel
To increase ILP
Deeper pipeline
Less work per stage  shorter clock cycle
Multiple issue
Replicate pipeline stages  multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice
Multiple Issue
Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses
instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at
runtime
Speculation
“Guess” what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple
issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated
Speculation and Exceptions
What if exception occurs on a speculatively
executed instruction?
e.g., speculative load before null-pointer check
Static speculation
Can add ISA support for deferring exceptions
Dynamic speculation
Can buffer exceptions until instruction completion (which
may not occur)
Static Multiple Issue
Compiler groups instructions into “issue packets”
• Group of instructions that can be issued on a single cycle
• Determined by pipeline resources required
Think of an issue packet as a very long
instruction
• Specifies multiple concurrent operations
•  Very Long Instruction Word (VLIW)
Scheduling Static Multiple
Issue
Compiler must remove some/all hazards
• Reorder instructions into issue packets
• No dependencies with a packet
• Possibly some dependencies between packets
• Varies between ISAs; compiler must know!
• Pad with nop if necessary
Dynamic Multiple Issue
“Superscalar” processors
CPU decides whether to issue 0, 1, 2, … each
cycle
• Avoiding structural and data hazards
Avoids the need for compiler scheduling
• Though it may still help
• Code semantics ensured by the CPU
Issues in Multiple Issue Processors
• True Data Dependency
• Procedural Dependency
• Resource Conflicts
• Output Dependency
• Antidependency
Instruction Issue Policy
•In-order Issue with in-order completion
•In-order Issue with out-of-order completion
•Out-of-order Issue with out-of-order completion

Amdahl’s Law and Parallel
Computers
A portion is sequential => limits parallel speedup
Speedup <= 1/ (1-FracX)
Ex. What fraction sequetial to get 80X speedup from
100 processors? Assume either 1 processor or 100
fully used
80 = 1 / [(FracX/100 + (1-FracX)]
0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1
FracX = (80-1)/79.2 = 0.9975
Only 0.25% sequential!
Strong vs Weak Scaling
 Strong scaling: problem size fixed
 As in example
 Weak scaling: problem size proportional to
number of processors
 10 processors, 10 × 10 matrix
 Time = 20 × tadd
 100 processors, 32 × 32 matrix
 Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
 Constant performance in this example
Instruction and Data Streams
 An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
 SPMD: Single Program Multiple Data

 A parallel program on a MIMD computer
 Conditional code for different processors
Vector Processors
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into registers
 Results stored from registers to memory
 Example: Vector extension to MIPS
 32 × 64-element registers (64-bit elements)
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double
 Significantly reduces instruction-fetch bandwidth
Vector vs. Scalar
 Vector architectures and compilers
 Simplify data-parallel programming
 Explicit statement of absence of loop-
carried dependences
 Reduced checking in hardware
 Regular access patterns benefit from
interleaved and burst memory
 Avoid control hazards by avoiding
loops
 More general than ad-hoc media
extensions (such as MMX,
SSE)
 Better match with compiler
SIMD
 Operate elementwise on vectors of data
 E.g., MMX and SSE instructions in x86
 Multiple data elements in 128-bit wide registers
 All processors execute the same
instruction at the same time
 Each with different data address,
etc.
 Simplifies synchronization
 Reduced instruction control
hardware
 Works best for highly data-parallel
applications
Vector vs. Multimedia Extensions
 Vector instructions have a variable vector width,
multimedia extensions have a fixed width
 Vector instructions support strided access,
multimedia extensions do not
 Vector units can be combination of pipelined and
arrayed functional units:
Hardware Multithreading
 Performing multiple threads of execution in
parallel
 Replicate registers, PC, etc.
 Fast switching between threads
 Fine-grain multithreading
 Switch threads after each cycle
 Interleave instruction execution
 If one thread stalls, others are executed
 Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Simultaneous Multithreading
 In multiple-issue dynamically scheduled
processor
 Schedule instructions from multiple
threads
 Instructions from independent threads execute
when function units are available
 Within threads, dependencies handled by
scheduling and register renaming
 Example: Intel Pentium-4 HT
 Two threads: duplicated registers, shared
function units and caches
Multithreading Example
Future of Multithreading
 Will it survive? In what form?
 Power considerations  simplified
microarchitectures
 Simpler forms of multithreading
 Tolerating cache-miss latency
 Thread switch may be most
effective
 Multiple simple cores might share
resources more effectively
Shared Memory
 SMP: shared memory multiprocessor
 Hardware provides single physical
address space for all processors
 Synchronize shared variables
using locks
 Memory access time
 UMA (uniform) vs. NUMA
(nonuniform)
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < i = i + 1)
sum[Pn] = sum[Pn] + A[i];
1000*(Pn+1);
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when
half is odd;
Processor0 gets missing
element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] +
sum[Pn+half];
until (half == 1);
§6.6 Introduction to Graphics Processing Units
History of GPUs
 Early video cards
 Frame buffer memory with address generation for
video output
 3D graphics processing
 Originally high-end computers (e.g., SGI)
 Moore’s Law  lower cost, higher density
 3D graphics cards for PCs and game consoles
 Graphics Processing Units
 Processors oriented to 3D graphics tasks
 Vertex/pixel processing, shading, texture mapping,
rasterization
Graphics in the System
GPU Architectures
 Processing is highly data-parallel
 GPUs are highly multithreaded
 Use thread switching to hide memory latency
 Less reliance on multi-level caches
 Graphics memory is wide and high-bandwidth
 Trend toward general purpose GPUs
 Heterogeneous CPU/GPU systems
 CPU for sequential code, GPU for parallel code
 Programming languages/APIs
 DirectX, OpenGL
 C for Graphics (Cg), High Level Shader Language
(HLSL)
 Compute Unified Device Architecture (CUDA)
Example: NVIDIA Tesla
Streaming
multiprocessor
8 × Streaming
processors
Example: NVIDIA Tesla
 Streaming Processors
 Single-precision FP and integer units
 Each SP is fine-grained multithreaded
 Warp: group of 32 threads
 Executed in parallel,
SIMD style
 8 SPs
× 4 clock cycles
 Hardware contexts
for 24 warps
 Registers, PCs, …
Classifying GPUs
 Don’t fit nicely into SIMD/MIMD model
 Conditional execution in a thread allows an
illusion of MIMD
 But with performance degredation
 Need to write general purpose code with care
Static: Discovered Dynamic: Discovered

at Compile Time at Runtime
Instruction-Level VLIW Superscalar
Parallelism
Data-Level SIMD or Vector Tesla Multiprocessor
Parallelism
GPU Memory Structures
§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost (prefer virtual machines)
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
 Sum 100,000 on 100 processors
 First distribute 100 numbers to each
 The do partial sums
sum = 0;
for(i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
 Reduction
 Half the processors send, other half receive
and add
 The quarter send, quarter receive and add,
…
Sum Reduction (Again)
 Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
 Send/receive also provide synchronization
 Assumes send/receive take similar time to addition
Grid Computing
 Separate computers interconnected by
long-haul networks
 E.g., Internet connections
 Work units farmed out, results sent back
 Can make use of idle time on PCs
 E.g., SETI@home, World Community
Grid
Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Multistage Networks

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

Copyright:

Available Formats

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

Copyright:

Available Formats

COMPUTER ORGANIZATION AND DESIGN 5th

The Hardware/Software Interface Edition

Parallel Processors from

Questions about parallel computers:

•In-order Issue with in-order completion

•In-order Issue with out-of-order completion

•Out-of-order Issue with out-of-order completion

 SPMD: Single Program Multiple Data

Static: Discovered Dynamic: Discovered

You might also like