Nothing Special   »   [go: up one dir, main page]

Multithreaded Architectures: Lecture 5: Performance Considerations

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

Multithreaded Architectures

(Applied Parallel Programming)

Lecture 5: Performance
Considerations

1
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Global Memory (DRAM) Bandwidth

Ideal Reality

2
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bank Organization
• Each core array has
Row Memory Cell
about O(1M) bits
Row
Decoder Core Array
Addr • Each bit is stored in a
tiny capacitor, made
of one transistor
Sense Amps

Column Latches

Wide
Column Mux Pin Interface
Addr
Narrow
Off-chip Data
3
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
A very small (8x2 bit) DRAM Bank
0 1 1

decode

Sense amps

Mux 4
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM core arrays are slow.
• Reading from a cell in the core array is a very
slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future
About 1000 cells connected to
each vertical line
decode

A very small capacitance


that stores a data bit
5
©Wen-mei W. Hwu and David Kirk/NVIDIA, To sense amps
DRAM Bursting (burst size = 4 bits)
0 1 0

decode

Sense amps

Mux 6
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting (cont.)
second part of the burst
0 1 1

decode

Sense amps and buffer

Mux 7
©Wen-mei W. Hwu and David Kirk/NVIDIA,
DRAM Bursting for the 8x4 Bank

Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time

Non-burst timing
Modern DRAM systems are designed to
be always accessed in burst mode. Burst
Burst timing bytes are transferred but discarded when
accesses are not to sequential locations.
8
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Multiple DRAM Banks
0 0 1 1

decode
decode

Sense amps Sense amps

Mux Mux
Bank 0 Bank 1
9
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
DRAM Bursting for the 8x4 Bank
Address bits
to decoder
2 bits 2 bits
Core Array access delay to pin to pin
time

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time


10
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Placing a 2D C array into linear
memory space (review)

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3

M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3


M

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

linearized order in increasing address


11
©Wen-mei W. Hwu and David Kirk/NVIDIA,
A Simple Matrix Multiplication Kernel
(review)
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;

// each thread computes one element of the block sub-matrix


for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];

P[Row*Width+Col] = Pvalue;
}
}
12
©Wen-mei W. Hwu and David Kirk/NVIDIA,
ECE408/CS483/ECE498AL, University of Illinois, 2007-2016
Two Access Patterns

d_M d_N

Thread 1

WIDTH
Thread 2
WIDTH

(a) (b)

M[Row*Width+k] N[k*Width+Col]
k is loop counter in the inner product loop of the kernel code13
©Wen-mei W. Hwu and David Kirk/NVIDIA,
N accesses are coalesced.
N0,0 N0,1 N0,2 N0,3
Access
direction in N1,0 N1,1 N1,2 N1,3
Kernel
N2,0 N2,1 N2,2 N2,3
code
N3,0 N3,1 N3,2 N3,3

Load iteration 0 Load iteration 1



T0 T1 T2 T3 T0 T1 T2 T3
N

N0,0 N0,1 N0,2 N0,3 N1,0 N1,1 N1,2 N1,3 N2,0 N2,1 N2,2 N2,3 N3,0 N3,1 N3,2 N3,3

14
©Wen-mei W. Hwu and David Kirk/NVIDIA,
M accesses are not coalesced.

Access M0,0 M0,1 M0,2 M0,3


direction in M1,0 M1,1 M1,2 M1,3
Kernel M[Row*Width+k]
code M2,0 M2,1 M2,2 M2,3

M3,0 M3,1 M3,2 M3,3



Load iteration 1
T0 T1 T2 T3

Load iteration 0
T0 T1 T2 T3
M

M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3 15
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Use shared memory to enable
coalescing in tiled matrix multiplication
d_M d_N

Original

WIDTH
Access
Pattern

WIDTH
Copy into
scratchpad
memory
d_M d_N

Tiled
Access
Pattern Perform
multiplication
with scratchpad
values
16
©Wen-mei W. Hwu and David Kirk/NVIDIA,
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory


9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k)


13. Pvalue += Mds[ty][k] * Nds[k][tx];
14. __syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Constants
• Immediate address constants I$
L1
• Indexed address constants
• Constants stored in DRAM, and cached on Multithreaded
Instruction Buffer

chip

R C$ Shared
L1 per SM F L1 Mem

• A constant value can be broadcast to all


Operand Select
threads in a Warp
– Extremely efficient way of accessing a value
MAD SFU
that is common for all threads in a block!

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
G80 Shared Memory
• Each SM has 16 KB of Shared I$
L1
Memory
– 16 banks of 32bit words Multithreaded
Instruction Buffer

• CUDA uses Shared Memory as


R C$ Shared

shared storage visible to all threads in F L1 Mem

a thread block Operand Select

– read and write access


MAD SFU

20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
Parallel Memory Architecture
• In a parallel machine, many threads access memory
– Therefore, memory is divided into banks Bank 0
– Essential to achieve high bandwidth Bank 1
Bank 2
Bank 3
Bank 4
• Each bank can service one address per cycle Bank 5
– A memory can service as many simultaneous Bank 6
accesses as it has banks Bank 7

• Multiple simultaneous accesses to a bank Bank 15


result in a bank conflict
– Conflicting accesses are serialized

21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• No Bank Conflicts • No Bank Conflicts
– Linear addressing – 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15

22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Bank Addressing Examples
• 2-way Bank Conflicts • 8-way Bank Conflicts
– Linear addressing – Linear addressing
stride == 2 stride == 8
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15

23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
How addresses map to banks on
G80
• Each bank has a bandwidth of 32 bits per clock
cycle
• Successive 32-bit words are assigned to
successive banks
• G80 has 16 banks
– So bank = address mod 16
– Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no
bank conflicts
• The fast case:
– If all threads of a half-warp access different banks, there is no
bank conflict
– If all threads of a half-warp access the identical address, there
is no bank conflict (broadcast)
• The slow case:
– Bank Conflict: multiple threads in the same half-warp access
the same bank
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Linear Addressing s=1
Thread 0 Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7

float foo =
shared[baseIndex + s * Thread 15 Bank 15

threadIdx.x];
s=3
Thread 0 Bank 0

• This is only bank-conflict-free if s Thread 1


Thread 2
Bank 1
Bank 2

shares no common factors with the


Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
number of banks Thread 6
Thread 7
Bank 6
Bank 7

– 16 on G80, so s must be odd


Thread 15 Bank 1526
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Control Flow

27
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Control Flow Instructions
• Main performance concern with branching is divergence
– Threads within a single warp take different paths
– Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed
one at a time until there is no more.
• A common case: avoid divergence when branch
condition is a function of thread ID
– Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Threads in the same warp follow different paths; threads 0, 1 and
2 follow different path than the rest of the threads in the first warp
– Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• But all threads in any given warp follow the same path

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Parallel Reduction
• Given an array of values, “reduce” them to a
single value in parallel
• Examples
– sum reduction: sum of all values in the array
– Max reduction: maximum of all values in the array

• Typically parallel implementation:


– Recursively halve # threads, add two values per
thread
– Takes log(n) steps for n elements, requires n/2
threads 29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A Vector Reduction Example
• Assume an in-place reduction using shared
memory
– The original vector is in device global memory
– The shared memory used to hold a partial sum vector
– Each iteration brings the partial sum vector closer to
the final sum
– The final solution will be in element 0

30
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A simple implementation
• Assume we have already loaded array into
__shared__ float partialSum[]

unsigned int t = threadIdx.x;


for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Vector Reduction with Branch
Divergence
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10

0 1 2 3 4 5 6 7 8 9 10 11

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

3 0..7 8..15
iterations

Array elements 32
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Some Observations
• In each iteration, two control flow paths will be
sequentially traversed for each warp
– Threads that perform addition and threads that do not
– Threads that do not perform addition may cost extra cycles
depending on the implementation of divergence
• No more than half of threads will be executing at any time
– All odd index threads do not do addition right from the beginning!
– On average, less than ¼ of the threads will be activated for all
warps over time.
– After the 5th iteration, entire warps in each block will not do
addition, poor resource utilization but no divergence in warps that
do not do addition.
• This can go on for a while, up to 4 more iterations (512/32=16= 24),
where each iteration only has one thread activated until all warps
retire
33
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Shortcomings of the implementation
• Assume we have already loaded array into
__shared__ float partialSum[]
BAD: Divergence
unsigned int t = threadIdx.x; due to interleaved
for (unsigned int stride = 1; branch decisions
stride < blockDim.x; stride *= 2)
{
__syncthreads();
if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
34
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
A better implementation
• Assume we have already loaded array into
__shared__ float partialSum[]

unsigned int t = threadIdx.x;


for (unsigned int stride = blockDim.x>>1;
stride > 0; stride >>=1)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
35
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
No Divergence until < 16 sub-sums
Thread 0 Thread 1 Thread 2 Thread 14Thread 15

0 1 2 3 … 13 14 15 16 17 18 19

1 0+16 15+31

36
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Registers, ILP and Instruction Mix

37
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Programmer View of Register File
4 blocks 3 blocks
• There are 8192 registers
in each SM in G80
– This is an implementation
decision, not part of CUDA
– Registers are dynamically
partitioned across all blocks
assigned to the SM
– Once assigned to a block,
the register is NOT
accessible by threads in
other blocks
– Each thread in the same
block only access registers
assigned to itself
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
• If each Block has 16X16 threads and each thread uses
10 registers, how many thread can run on each SM?
– Each block requires 10*256 = 2560 registers
– 8192 = 3 * 2560 + change
– So, three blocks can run on an SM as far as registers are
concerned
• We want to optimize the performance of the computation
done by each thread (the kernel code). The optimized
kernel code increases the use of registers by 1.
– Each Block now requires 11*256 = 2816 registers
– 8192 < 2816 *3
– Only two Blocks can run on an SM, 1/3 reduction of
parallelism!!!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39
ECE 408, University of Illinois, Urbana-Champaign
Resource Allocation Example
TB0 (cont.)
Insufficient
TB1 TB2 registers to allocate
3 blocks

Thread Contexts Thread Contexts

SP0 SP7 SP0 SP7

……… ………

32KB Register File 32KB Register File


X
16KB Shared Memory 16KB Shared Memory

(a) Pre-“optimization” (b) Post-“optimization”

Increase in per-thread performance, but fewer threads:


Lower
© David Kirk/NVIDIA and overall
Wen-mei W. performance in this case
Hwu, 2007-2010 40
ECE 408, University of Illinois, Urbana-Champaign
More on Dynamic Partitioning
• Dynamic partitioning gives more flexibility to
compilers/programmers
– One can run a smaller number of threads that require
many registers each or a large number of threads that
require few registers each
• This allows for finer grain threading than traditional CPU
threading models.
– The compiler can tradeoff between instruction-level
parallelism and thread level parallelism

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41


ECE 408, University of Illinois, Urbana-Champaign
ILP vs. TLP Example
• Assume that a kernel has 256-thread Blocks, 4
independent instructions for each global memory load in
the thread program, and each thread uses 10 registers,
global loads have 200 cycles
– 3 Blocks can run on each SM
• If a compiler can use one more register to change the
dependence pattern so that 8 independent instructions
exist for each global memory load
– Only two can run on each SM
– However, one only needs 200/(8*4) = 7 Warps to tolerate the
memory latency
– Two blocks have 16 Warps. The performance can be actually
higher!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 42
ECE 408, University of Illinois, Urbana-Champaign
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width)
{
1. __shared __float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared __float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;

7. float Pvalue = 0;

43
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Tiled Matrix Multiplication Kernel (cont.)
// Loop over the Md and Nd tiles required to compute the Pd
element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {

// Collaborative loading of Md and Nd tiles into shared memory


9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10. Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
11. __syncthreads();

12. for (int k = 0; k < TILE_WIDTH; ++k)


13. Pvalue += Mds[ty][k] * Nds[k][tx];
14. __syncthreads();
}
15. Pd[Row*Width + Col] = Pvalue;
}
44
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Prefetching
Loop { Load first tile from global
memory into registers
Load current tile to shared
memory Loop {
Deposit current tile from
__syncthreads() registers to shared memory
__syncthreads()
Compute current tile
Load next tile from global
__syncthreads() memory into registers
}
Compute current tile

__syncthreads()
45
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
}
bx
Prefetch 0 1 2

tx
• Deposit blue tile from register 0 1 2 TILE_WIDTH-1

into shared memory Nd

TILE_WIDTH
• Syncthreads

WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md
memory
Pd

0
• ….

TILE_WIDTHE
0 Pdsub
1

WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
46
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Instruction Mix Considerations
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];

There is only one mul/add between branches and


address calculation.

Loop unrolling can help.

Pvalue += Mds[ty][0] * Nds[0][tx] + …


Mds[ty][15] * Nds[15][tx];

47
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Unrolling
Ctemp = 0; Ctemp = 0;
for (...) { for (...) {
__shared__ float As[16][16]; __shared__ float As[16][16];
__shared__ float Bs[16][16]; __shared__ float Bs[16][16];

// load input tile elements // load input tile elements


As[ty][tx] = A[indexA]; As[ty][tx] = A[indexA];
Bs[ty][tx] = B[indexB]; Bs[ty][tx] = B[indexB];
indexA += 16; indexA += 16;
indexB += 16 * widthB; indexB += 16 * widthB;
__syncthreads(); __syncthreads();

// compute results for tile // compute results for tile


for (i = 0; i < 16; i++) Ctemp +=
{ As[ty][0] * Bs[0][tx];
Ctemp += As[ty][i] ...
* Bs[i][tx]; Ctemp +=
} As[ty][15] * Bs[15][tx];

__syncthreads(); __syncthreads(); Does this use


} }
C[indexC] = Ctemp; C[indexC] = Ctemp;more registers?

(b) Tiled Version (c) Unrolled Version

Removal of branch instructions and address calculations

48
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
Major Performance Detractors
• Long-latency operations
– Avoid stalls by executing other threads
• Stalls and bubbles in the pipeline
– Barrier synchronization
– Branch divergence
• Shared resource saturation
– Global memory bandwidth
– Shared memory capacity
49
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010

You might also like