Parallel Processors From Client To Cloud: Omputer Rganization and Esign
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
80 = 1 / [(FracX/100 + (1-FracX)]
0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1
FracX = (80-1)/79.2 = 0.9975
Only 0.25% sequential!
Strong vs Weak Scaling
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to
number of processors
10 processors, 10 × 10 matrix
Time = 20 × tadd
100 processors, 32 × 32 matrix
Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example
Instruction and Data Streams
An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when
half is odd;
Processor0 gets missing
element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] +
sum[Pn+half];
until (half == 1);
§6.6 Introduction to Graphics Processing Units
History of GPUs
Early video cards
Frame buffer memory with address generation for
video output
3D graphics processing
Originally high-end computers (e.g., SGI)
Moore’s Law lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
Graphics in the System
GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded
Use thread switching to hide memory latency
Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Example: NVIDIA Tesla
Streaming
multiprocessor
8 × Streaming
processors
Example: NVIDIA Tesla
Streaming Processors
Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads
Executed in parallel,
SIMD style
8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs, …
Classifying GPUs
Don’t fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an
illusion of MIMD
But with performance degredation
Need to write general purpose code with care
Parallelism
GPU Memory Structures
§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
Each processor has private physical
address space
Hardware sends/receives messages
between processors
Loosely Coupled Clusters
Network of independent computers
Each has private memory and OS
Connected using I/O system
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems
Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
Sum 100,000 on 100 processors
First distribute 100 numbers to each
The do partial sums
sum = 0;
for(i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
Reduction
Half the processors send, other half receive
and add
The quarter send, quarter receive and add,
…
Sum Reduction (Again)
Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
Send/receive also provide synchronization
Assumes send/receive take similar time to addition
Grid Computing
Separate computers interconnected by
long-haul networks
E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
E.g., SETI@home, World Community
Grid
Interconnection Networks
Network topologies
Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Multistage Networks