Multithreaded Architectures: Lecture 5: Performance Considerations

Multithreaded Architectures
(Applied Parallel Programming)
Lecture 5: Performance
Considerations
1
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Global Memory (DRAM) Bandwidth
Ideal Reality
2
DRAM Bank Organization
• Each core array has
Row Memory Cell
about O(1M) bits
Row
Decoder Core Array
Addr • Each bit is stored in a
tiny capacitor, made
of one transistor
Sense Amps
Column Latches
Wide
Column Mux Pin Interface
Addr
Narrow
Off-chip Data
3
A very small (8x2 bit) DRAM Bank
0 1 1
decode
Sense amps
Mux 4
DRAM core arrays are slow.
• Reading from a cell in the core array is a very
slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future
About 1000 cells connected to
each vertical line
decode
A very small capacitance

that stores a data bit
5
©Wen-mei W. Hwu and David Kirk/NVIDIA, To sense amps
DRAM Bursting (burst size = 4 bits)
0 1 0
decode
Sense amps
Mux 6
DRAM Bursting (cont.)
second part of the burst
0 1 1
decode
Sense amps and buffer
Mux 7
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time
Non-burst timing
Modern DRAM systems are designed to
be always accessed in burst mode. Burst
Burst timing bytes are transferred but discarded when
accesses are not to sequential locations.
8
Multiple DRAM Banks
0 0 1 1
decode
decode
Sense amps Sense amps
Mux Mux
Bank 0 Bank 1
9
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time
Single-Bank burst timing, dead time on interface
Multi-Bank burst timing, reduced dead time

10
Placing a 2D C array into linear
memory space (review)
M0,0 M0,1 M0,2 M0,3
M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3

M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3
linearized order in increasing address

11
A Simple Matrix Multiplication Kernel
(review)
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];
P[Row*Width+Col] = Pvalue;
}
}
12
Two Access Patterns
d_M d_N
Thread 1
WIDTH
Thread 2
WIDTH
(a) (b)
M[Row*Width+k] N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code13
N accesses are coalesced.
N0,0 N0,1 N0,2 N0,3
Access
direction in N1,0 N1,1 N1,2 N1,3
Kernel
N2,0 N2,1 N2,2 N2,3
code
N3,0 N3,1 N3,2 N3,3
Load iteration 0 Load iteration 1

…
T0 T1 T2 T3 T0 T1 T2 T3
N
N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3
14
M accesses are not coalesced.
Access M0,0 M0,1 M0,2 M0,3

direction in M1,0 M1,1 M1,2 M1,3
Kernel M[Row*Width+k]
code M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3

…
Load iteration 1
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 15
Use shared memory to enable
coalescing in tiled matrix multiplication
d_M d_N
Original
WIDTH
Access
Pattern
WIDTH
Copy into
scratchpad
memory
d_M d_N
Tiled
Access
Pattern Perform
multiplication
with scratchpad
values
16
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)

13. Pvalue += Mds[ty][k] * Nds[k][tx];
}
15. Pd[Row*Width + Col] = Pvalue;
}
18
Constants
• Immediate address constants I$
L1
• Indexed address constants
• Constants stored in DRAM, and cached on Multithreaded
Instruction Buffer
chip
–
R C$ Shared
L1 per SM F L1 Mem
• A constant value can be broadcast to all

Operand Select
threads in a Warp
– Extremely efficient way of accessing a value
MAD SFU
that is common for all threads in a block!
19
ECE 408, University of Illinois, Urbana-Champaign
G80 Shared Memory
• Each SM has 16 KB of Shared I$
L1
Memory
– 16 banks of 32bit words Multithreaded
Instruction Buffer
• CUDA uses Shared Memory as

R C$ Shared
shared storage visible to all threads in F L1 Mem
a thread block Operand Select
– read and write access

MAD SFU
20
Parallel Memory Architecture
• In a parallel machine, many threads access memory
– Therefore, memory is divided into banks Bank 0
– Essential to achieve high bandwidth Bank 1
Bank 2
Bank 3
Bank 4
• Each bank can service one address per cycle Bank 5
– A memory can service as many simultaneous Bank 6
accesses as it has banks Bank 7
• Multiple simultaneous accesses to a bank Bank 15

result in a bank conflict
– Conflicting accesses are serialized
21
Bank Addressing Examples
• No Bank Conflicts • No Bank Conflicts
– Linear addressing – 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
22
Bank Addressing Examples
• 2-way Bank Conflicts • 8-way Bank Conflicts
– Linear addressing – Linear addressing
stride == 2 stride == 8
x8
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Thread 8 x8
Thread 9
Thread 10
23
How addresses map to banks on
G80
• Each bank has a bandwidth of 32 bits per clock
cycle
• Successive 32-bit words are assigned to
successive banks
• G80 has 16 banks
– So bank = address mod 16
– Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp
24
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no
bank conflicts
• The fast case:
– If all threads of a half-warp access different banks, there is no
bank conflict
– If all threads of a half-warp access the identical address, there
is no bank conflict (broadcast)
• The slow case:
– Bank Conflict: multiple threads in the same half-warp access
the same bank
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank
25
Linear Addressing s=1
Thread 0 Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7
float foo =
shared[baseIndex + s * Thread 15 Bank 15
threadIdx.x];
s=3
Thread 0 Bank 0
• This is only bank-conflict-free if s Thread 1

Thread 2
Bank 1
Bank 2
shares no common factors with the

Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
number of banks Thread 6
Thread 7
Bank 6
Bank 7
– 16 on G80, so s must be odd

Thread 15 Bank 1526
Control Flow
27
Control Flow Instructions
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: avoid divergence when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Threads in the same warp follow different paths; threads 0, 1 and
2 follow different path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• But all threads in any given warp follow the same path
28
Parallel Reduction
• Given an array of values, “reduce” them to a
single value in parallel
• Examples
– sum reduction: sum of all values in the array
– Max reduction: maximum of all values in the array
• Typically parallel implementation:

– Recursively halve # threads, add two values per
thread
– Takes log(n) steps for n elements, requires n/2
threads 29
A Vector Reduction Example
• Assume an in-place reduction using shared
memory
– The original vector is in device global memory
– The shared memory used to hold a partial sum vector
– Each iteration brings the partial sum vector closer to
the final sum
– The final solution will be in element 0
30
A simple implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
unsigned int t = threadIdx.x;

for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
31
Vector Reduction with Branch
Divergence
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10
0 1 2 3 4 5 6 7 8 9 10 11
1 0+1 2+3 4+5 6+7 8+9 10+11
2 0...3 4..7 8..11
3 0..7 8..15
iterations
Array elements 32
Some Observations
• In each iteration, two control flow paths will be
sequentially traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition may cost extra cycles
depending on the implementation of divergence
• No more than half of threads will be executing at any time
– All odd index threads do not do addition right from the beginning!
– On average, less than ¼ of the threads will be activated for all
warps over time.
– After the 5th iteration, entire warps in each block will not do
addition, poor resource utilization but no divergence in warps that
do not do addition.
• This can go on for a while, up to 4 more iterations (512/32=16= 24),
where each iteration only has one thread activated until all warps
retire
33
Shortcomings of the implementation
BAD: Divergence
unsigned int t = threadIdx.x; due to interleaved
for (unsigned int stride = 1; branch decisions
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
}
34
A better implementation
unsigned int t = threadIdx.x;

for (unsigned int stride = blockDim.x>>1;
stride > 0; stride >>=1)
{
__syncthreads();
if (t < stride)
}
35
No Divergence until < 16 sub-sums
Thread 0 Thread 1 Thread 2 Thread 14Thread 15
0 1 2 3 … 13 14 15 16 17 18 19
1 0+16 15+31
36
Registers, ILP and Instruction Mix
37
Programmer View of Register File
4 blocks 3 blocks
• There are 8192 registers
in each SM in G80
– This is an implementation
decision, not part of CUDA
– Registers are dynamically
partitioned across all blocks
assigned to the SM
– Once assigned to a block,
the register is NOT
accessible by threads in
other blocks
– Each thread in the same
block only access registers
assigned to itself
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38
Resource Allocation Example
• If each Block has 16X16 threads and each thread uses
10 registers, how many thread can run on each SM?
– Each block requires 10*256 = 2560 registers
– 8192 = 3 * 2560 + change
– So, three blocks can run on an SM as far as registers are
concerned
• We want to optimize the performance of the computation
done by each thread (the kernel code). The optimized
kernel code increases the use of registers by 1.
– Each Block now requires 11*256 = 2816 registers
– 8192 < 2816 *3
– Only two Blocks can run on an SM, 1/3 reduction of
parallelism!!!
Resource Allocation Example
TB0 (cont.)
Insufficient
TB1 TB2 registers to allocate
3 blocks
Thread Contexts Thread Contexts
SP0 SP7 SP0 SP7
……… ………
32KB Register File 32KB Register File

X
16KB Shared Memory 16KB Shared Memory
(a) Pre-“optimization” (b) Post-“optimization”
Increase in per-thread performance, but fewer threads:

Lower
© David Kirk/NVIDIA and overall
Wen-mei W. performance in this case
Hwu, 2007-2010 40
More on Dynamic Partitioning
• Dynamic partitioning gives more flexibility to
compilers/programmers
– One can run a smaller number of threads that require
many registers each or a large number of threads that
require few registers each
• This allows for finer grain threading than traditional CPU
threading models.
– The compiler can tradeoff between instruction-level
parallelism and thread level parallelism

ILP vs. TLP Example
• Assume that a kernel has 256-thread Blocks, 4
independent instructions for each global memory load in
the thread program, and each thread uses 10 registers,
global loads have 200 cycles
– 3 Blocks can run on each SM
• If a compiler can use one more register to change the
dependence pattern so that 8 independent instructions
exist for each global memory load
– Only two can run on each SM
– However, one only needs 200/(8*4) = 7 Warps to tolerate the
memory latency
– Two blocks have 16 Warps. The performance can be actually
higher!
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
43
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles into shared memory

9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
12. for (int k = 0; k < TILE_WIDTH; ++k)

13. Pvalue += Mds[ty][k] * Nds[k][tx];
}
15. Pd[Row*Width + Col] = Pvalue;
}
44
Prefetching
Loop { Load first tile from global
memory into registers
Load current tile to shared
memory Loop {
Deposit current tile from
__syncthreads() registers to shared memory
__syncthreads()
Compute current tile
Load next tile from global
__syncthreads() memory into registers
}
Compute current tile
__syncthreads()
45
}
bx
Prefetch 0 1 2
tx
• Deposit blue tile from register 0 1 2 TILE_WIDTH-1
into shared memory Nd
TILE_WIDTH
• Syncthreads
WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md
memory
Pd
0
• ….
TILE_WIDTHE
0 Pdsub
1
WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
46
Instruction Mix Considerations
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
There is only one mul/add between branches and

address calculation.
Loop unrolling can help.
Pvalue += Mds[ty][0] * Nds[0][tx] + …

Mds[ty][15] * Nds[15][tx];
47
Unrolling
Ctemp = 0; Ctemp = 0;
for (...) { for (...) {
__shared__ float As[16][16]; __shared__ float As[16][16];
__shared__ float Bs[16][16]; __shared__ float Bs[16][16];
// load input tile elements // load input tile elements

As[ty][tx] = A[indexA]; As[ty][tx] = A[indexA];
Bs[ty][tx] = B[indexB]; Bs[ty][tx] = B[indexB];
indexA += 16; indexA += 16;
indexB += 16 * widthB; indexB += 16 * widthB;
__syncthreads(); __syncthreads();
// compute results for tile // compute results for tile

for (i = 0; i < 16; i++) Ctemp +=
{ As[ty][0] * Bs[0][tx];
Ctemp += As[ty][i] ...
* Bs[i][tx]; Ctemp +=
} As[ty][15] * Bs[15][tx];
__syncthreads(); __syncthreads(); Does this use

} }
C[indexC] = Ctemp; C[indexC] = Ctemp;more registers?
(b) Tiled Version (c) Unrolled Version
Removal of branch instructions and address calculations
48
Major Performance Detractors
• Long-latency operations
– Avoid stalls by executing other threads
• Stalls and bubbles in the pipeline
– Barrier synchronization
– Branch divergence
• Shared resource saturation
– Global memory bandwidth
– Shared memory capacity
49

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

Copyright:

Available Formats

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multithreaded Architectures: Lecture 5: Performance Considerations

Uploaded by

Copyright:

Available Formats

Multithreaded Architectures

(Applied Parallel Programming)

A very small capacitance

Sense amps and buffer

Sense amps Sense amps

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3

linearized order in increasing address

if ((Row < Width) && (Col < Width)) {

// each thread computes one element of the block sub-matrix

Load iteration 0 Load iteration 1

Access M0,0 M0,1 M0,2 M0,3

M3,0 M3,1 M3,2 M3,3

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

// Collaborative loading of Md and Nd tiles into shared memory

12. for (int k = 0; k < TILE_WIDTH; ++k)

• A constant value can be broadcast to all

• CUDA uses Shared Memory as

shared storage visible to all threads in F L1 Mem

a thread block Operand Select

– read and write access

• Multiple simultaneous accesses to a bank Bank 15

Thread 15 Bank 15 Thread 15 Bank 15

• This is only bank-conflict-free if s Thread 1

shares no common factors with the

– 16 on G80, so s must be odd

• Typically parallel implementation:

unsigned int t = threadIdx.x;

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

unsigned int t = threadIdx.x;

Thread Contexts Thread Contexts

SP0 SP7 SP0 SP7

32KB Register File 32KB Register File

(a) Pre-“optimization” (b) Post-“optimization”

Increase in per-thread performance, but fewer threads:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work on

// Collaborative loading of Md and Nd tiles into shared memory

12. for (int k = 0; k < TILE_WIDTH; ++k)

into shared memory Nd

There is only one mul/add between branches and

Loop unrolling can help.

Pvalue += Mds[ty][0] * Nds[0][tx] + …

// load input tile elements // load input tile elements

// compute results for tile // compute results for tile

__syncthreads(); __syncthreads(); Does this use

(b) Tiled Version (c) Unrolled Version

Removal of branch instructions and address calculations

You might also like

syncthreads(); syncthreads(); Does this use