Singapore p1

Introduction to GPU
Programming
Volodymyr (Vlad) Kindratenko

Innovative Systems Laboratory @ NCSA
Institute for Advanced Computing
Applications and Technologies (IACAT)
Tutorial Goals
• Become familiar with NVIDIA GPU architecture
• Become familiar with the NVIDIA GPU
application development flow
• Be able to write and run simple NVIDIA GPU
kernels in CUDA
• Be aware of performance limiting factors and
understand performance tuning strategies
2
Schedule
• 9:00-11:00 part I (introduction)
• 11:00-11:15 break
• 11:15-13:00 part II (examples)
• 13:00-14:00 lunch
• 14:00-16:00 part III (advanced topics)
• 16:00-16:15 break
• 16:15-18:00 part IV (lab time)
3
Part I
• Introduction
• Hands-on: getting started with NCSA GPU
cluster
• Anatomy of a GPU application
• Break
4
Introduction
• Why use Graphics Processing Units (GPUs) for
general-purpose computing
• Modern GPU architecture
– NVIDIA
• GPU programming overview
– Libraries
– CUDA C
– OpenCL
– PGI x64+GPU
5
Why GPUs?
1200
Raw Performance Trends
GTX 285
1000
GTX 280
800
GFLOPS
8800 Ultra
600
8800 GTX
NVIDIA GPU
Intel CPU
400
7900 GTX
7800 GTX
200
6800 Ultra Intel Xeon Quad-core 3 GHz
5950 Ultra
5800
0
9/22/02 2/4/04 6/18/05 10/31/06 3/14/08
6
Graph is courtesy of NVIDIA
Why GPUs?
180 Memory Bandwidth Trends
GTX 285
160
GTX 280
140
120
8800 Ultra
GByte/s
100
8800 GTX
80
60
7800 GTX 7900 GTX
40 5950 Ultra
6800 Ultra
20
5800
0
9/22/02 2/4/04 6/18/05 10/31/06 3/14/08
7
GPU vs. CPU Silicon Use
8
NVIDIA GPU Architecture
• A scalable array of
multithreaded Streaming
Multiprocessors (SMs),
each SM consists of
– 8 Scalar Processor (SP)
cores
– 2 special function units for
transcendentals
– A multithreaded
instruction unit
– On-chip shared memory
• GDDR3 SDRAM
• PCIe interface
9
Figure is courtesy of NVIDIA
NVIDIA GeForce9400M G GPU
TPC • 16 streaming processors
Geometry controller
SMC arranged as 2 streaming

SM
I cache
MT issue
SM
I cache
MT issue
multiprocessors
• At 0.8 GHz this provides
C cache C cache
SP SP SP SP
SP SP SP SP
SP SP SP SP – 54 GFLOPS in single-
SP SP SP SP
SFU SFU SFU SFU precision (SP)

Shared Shared
• 128-bit interface to off-

memory memory
Texture units
Texture L1
chip GDDR3 memory
128-bit interconnect
– 21 GB/s bandwidth
L2 ROP ROP L2
DRAM DRAM
10
NVIDIA Tesla C1060 GPU
TPC 1 TPC 10
• 240 streaming
Geometry controller Geometry controller processors arranged
SM
SMC
SM SM SM
SMC
SM SM
as 30 streaming
I cache
MT issue
C cache
I cache
MT issue
C cache
I cache
MT issue
C cache
I cache
MT issue
C cache
I cache
MT issue
C cache
I cache
MT issue
C cache
multiprocessors
• At 1.3 GHz this
SP SP SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP SP SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
provides
SFU SFU
Shared
memory
SFU SFU
Shared
memory
SFU
Shared
memory
SFU SFU
Shared
memory
SFU SFU
Shared
memory
SFU SFU
Shared
memory
SFU
– 1 TFLOPS SP
Texture units Texture units – 86.4 GFLOPS DP
• 512-bit interface to
Texture L1 Texture L1
512-bit memory interconnect

off-chip GDDR3
L2 ROP ROP L2
memory
– 102 GB/s bandwidth
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
11
NVIDIA Tesla S1070 Computing Server
• 4 T10 GPUs 4 GB GDDR3

4 GB GDDR3
SDRAM
SDRAM
supply
Power
Tesla GPU
Tesla GPU
NVIDIA
PCI x16
management
SWITCH
Thermal
PCI x16
NVIDIA
SWITCH
monitoring
System
Tesla GPU
Tesla GPU
4 GB GDDR3
SDRAM 4 GB GDDR3
SDRAM
12
GPU Use/Programming
• GPU libraries
– NVIDIA’s CUDA BLAS and FFT libraries
– Many 3rd party libraries
• Low abstraction lightweight GPU
programming toolkits
– CUDA C
– OpenCL
• High abstraction compiler-based tools
– PGI x64+GPU
13
CUDA C APIs
• higher-level API called the CUDA • low-level API called the CUDA driver
runtime API API
– myKernel<<<grid size>>>(args); – cuModuleLoad( &module, binfile );
– cuModuleGetFunction( &func,
module, "mykernel" );
– …
– cuParamSetv( func, 0, &args, 48 );
– …
– cuParamSetSize( func, 48 );
– cuFuncSetBlockShape( func, ts[0],
ts[1], 1 );
– cuLaunchGrid( func, gs[0], gs[1] );
14
Getting Started with NCSA GPU Cluster
• Cluster architecture overview
• How to login and check out a node
• How to compile and run an existing
application
15
NCSA AC GPU Cluster
16
GPU Cluster Architecture
• Servers: 32 • Accelerator Units: 32
– CPU cores: 128 – GPUs: 128
ac
(head node)
ac01 ac02 ac32

(compute node) (compute node) (compute node)
17
GPU Cluster Node Architecture
IB
• HP xw9400 workstation
QDR IB
– 2216 AMD Opteron 2.4 HP xw9400 workstation
GHz dual socket dual
core
– 8 GB DDR2 PCIe x16 PCIe x16
– InfiniBand QDR
PCIe interface PCIe interface
• S1070 1U GPU
Computing Server T10 T10 T10 T10
– 1.3 GHz Tesla T10

processors DRAM DRAM DRAM DRAM
– 4x4 GB GDDR3 SDRAM Tesla S1070
Compute node
18
Accessing the GPU Cluster
• Use Secure Shell (SSH) client to access AC
– ssh USER@ac.ncsa.uiuc.edu (User: tra1 – tra54; Password: ???)
– You will see something like this printed out:
Mar 30, 2010

ac and all compute nodes updated to the release version of CUDA 3.0 from the
beta version of 3.0, which has been in place for months.
May 11, 2010

GeForce 480 Fermi based GPU now available on ac login node, in addition to
a GeForce 280 (more similar to Tesla GPUs)
Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu

(for password resets, please contact help@ncsa.uiuc.edu)
13:52:30 up 10 days, 22:23, 32 users, load average: 0.54, 0.29, 0.20

Job state_count = Transit:0 Queued:101 Held:0 Waiting:0 Running:111 Exiting:0
[tra1@ac ~]$ _
19
Installing Tutorial Examples
• Run this sequence to retrieve and install
tutorial examples:
cd
cp /tmp/tutorial.tgz .
tar -xvzf tutorial.tgz
cd tutorial
ls
src1 src2 src3
20
Accessing the GPU Cluster
Laptop 1 Laptop 2 Laptop 30
ac
You are here (head node)
ac01 ac02 ac32

21
Requesting a Cluster Node for
Interactive Use
• Run qstat to see what other users do
• Run qsub -I -l walltime=02:00:00 to request a

node with a single GPU for 2 hours of interactive
use
– You will see something like this printed out:
qsub: waiting for job 64424.acm to start

qsub: job 64424.acm ready
[gpuXYZ@ac32 ~]$ _
22
Requesting a Cluster Node
Laptop 1 Laptop 2 Laptop 30
ac
(head node) You are here
ac01 ac02 ac32

7/26/2009 23
Some useful utilities installed on AC
• As part of NVIDIA driver
– nvidia-smi (NVIDIA System Management Interface
program)
• As part of NVIDIA CUDA SDK
– deviceQuery
– bandwidthTest
• As part of CUDA wrapper
– wrapper_query
– showgputime/showallgputime (works from the head
node only)
24
nvidia-smi
Timestamp : Mon May 24 14:39:28 2010
Unit 0: Unit 1:
Product Name : NVIDIA Tesla S1070-400 Turn-key Product Name : NVIDIA Tesla S1070-400 Turn-key
Product ID : 920-20804-0006 Product ID : 920-20804-0006
Serial Number : 0324708000059 Serial Number : 0324708000059
Firmware Ver : 3.6 Firmware Ver : 3.6
Intake Temperature : 22 C Intake Temperature : 22 C
GPU 0: GPU 0:
Product Name : Tesla T10 Processor Product Name : Tesla T10 Processor
Serial : 2624258902399 Serial : 1930554578325
PCI ID : 5e710de PCI ID : 5e710de
Bridge Port : 0 Bridge Port : 0
Temperature : 33 C Temperature : 33 C
GPU 1: GPU 1:
Product Name : Tesla T10 Processor Product Name : Tesla T10 Processor
Serial : 2624258902399 Serial : 1930554578325
PCI ID : 5e710de PCI ID : 5e710de
Bridge Port : 2 Bridge Port : 2
Temperature : 30 C Temperature : 30 C
Fan Tachs: Fan Tachs:
#00: 3566 Status: NORMAL #00: 3584 Status: NORMAL
… …
PSU: PSU:
Voltage : 12.01 V Voltage : 11.99 V
Current : 19.14 A Current : 19.14 A
State : Normal State : Normal
LED: LED:
State : GREEN State : GREEN
25
deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "Tesla T10 Processor"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294770688 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive (only one host thread at a time can use this device)
26
wrapper_query
cuda_wrapper info:
version=2
• There are 4 GPUs per
userID=21783 cluster node
pid=-1
nGPU=1 • When requesting a node,
physGPU[0]=2 we can specify how may
key_env_var=
allow_user_passthru=1 GPUs should be allocated
affinity:
GPU=0, CPU=0 2
– e.g., `-l nodes=1:ppn=4`
GPU=1, CPU=0 2 in qsub resources string
GPU=2, CPU=1 3
GPU=3, CPU=1 3 will result in all 4 GPUs
cudaAPI = Unknown allocated
walltime = 10.228021 seconds
gpu_kernel_time = 0.000000 seconds • By default, only one GPU
gpu_usage = 0.00%
per node is allocated
27
Compiling and Running an Existing
Application
• cd tutorial/src1
– vecadd.c - reference C implementation
– vecadd.cu – CUDA implementation
• Compile & run CPU version

gcc vecadd.c -o vecadd_cpu
./vecadd_cpu
Running CPU vecAdd for 16384 elements
C[0]=2147483648.00 ...
• Compile & run GPU version

nvcc vecadd.cu -o vecadd_gpu
./vecadd_gpu
Running GPU vecAdd for 16384 elements
C[0]=2147483648.00 ...
28
nvcc
• Any source file containing CUDA C language
extensions must be compiled with nvcc
• nvcc is a compiler driver that invokes many other
tools to accomplish the job
• Basic nvcc usage
– nvcc <filename>.cu [-o <executable>]
• Builds release mode
– nvcc -deviceemu <filename>.cu
• Builds device emulation mode (all code runs on CPU)
– -g flag allows to build debug mode for gdb debugger
– nvcc --version
29
Anatomy of a GPU Application
• Host side
• Device side
30
Reference CPU Version
void vecAdd(int N, float* A, float* B, float* C) {
Computational kernel
for (int i = 0; i < N; i++) C[i] = A[i] + B[i];
}
int main(int argc, char **argv)

{
int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float));

float *B = (float*)malloc(N * sizeof(float)); Memory allocation
float *C = (float*)malloc(N * sizeof(float));
Kernel invocation
vecAdd(N, A, B, C); // call compute kernel
free(A); free(B); free(C);

Memory de-allocation
}
31
Adding GPU support
Host GPU card
CPU GPU
Host Device
Memory Memory
A gA
B gB
C gC
32
Memory Spaces
• CPU and GPU have separate memory spaces
– Data is moved across PCIe bus
– Use functions to allocate/set/copy memory on GPU
• Host (CPU) manages device (GPU) memory
– cudaMalloc(void** pointer, size_t nbytes)
– cudaFree(void* pointer)
– cudaMemcpy(void* dst, void* src, size_t nbytes, enum
cudaMemcpyKind direction);
• returns after the copy is complete
• blocks CPU thread until all bytes have been copied
• does not start copying until previous CUDA calls complete
– enum cudaMemcpyKind
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost
• cudaMemcpyDeviceToDevice
33
Adding GPU support
int main(int argc, char **argv)
{
int N = 16384; // default vector size
Memory allocation
float *A = (float*)malloc(N * sizeof(float));
on the GPU card
float *B = (float*)malloc(N * sizeof(float));
float *C = (float*)malloc(N * sizeof(float));
float *devPtrA, *devPtrB, *devPtrC; Copy data from the

CPU (host) memory
cudaMalloc((void**)&devPtrA, N * sizeof(float)); to the GPU (device)
cudaMalloc((void**)&devPtrB, N * sizeof(float)); memory
cudaMalloc((void**)&devPtrC, N * sizeof(float));
cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);
34
Adding GPU support
Kernel invocation
vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(devPtrA);
Copy results from
cudaFree(devPtrB);
device memory to
cudaFree(devPtrC); the host memory
Device memory
free(A);
de-allocation
free(B);
free(C);
}
35
GPU Kernel
• CPU version
void vecAdd(int N, float* A, float* B, float* C)
{
for (int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
• GPU version
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
36
Introduction to GPU programming
• CUDA programming model
• GPU memory hierarchy
• CUDA C
37
CUDA Programming Model
• A CUDA kernel is executed by threadID
an array of threads
– All threads run the same code (SPMD)‫‏‬ …
float x = input[threadID];
float y = func(x);
– Each thread has an ID that it uses output[threadID] = y;
…
to compute memory addresses and
make control decisions
• Threads are arranged as a grid of thread blocks
– Threads within Grid
a block have access Thread Block 0 Thread Block 1 Thread Block N-1
to a segment of …
shared memory Shared memory Shared memory Shared memory
38
Kernel Invocation Syntax
grid & thread block dimensionality
vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC);
Grid
Thread Block 0 Thread Block 1 Thread Block N-1
…
Shared memory Shared memory Shared memory
int i = blockIdx.x * blockDim.x + threadIdx.x;

block ID within a grid number of threads per block thread ID within a thread block
39
Mapping Threads to the Hardware
• Blocks of threads are transparently • Blocks must be independent
assigned to SMs – Any possible interleaving of blocks
– A block of threads executes on one should be valid
SM & does not migrate – Blocks may coordinate but not
– Several blocks can reside synchronize
concurrently on one SM – Thread blocks can run in any order
Device Kernel grid

Device
Block 0 Block 1
Block 2 Block 3
Block 0 Block 1 Block 4 Block 5

Block 6 Block 7
time
Block 2 Block 3
Block 4 Block 5
Each block can execute in any
Block 6 Block 7 order relative to other blocks.
40
Slide is courtesy of NVIDIA
CUDA Programming Model
• A kernel is executed as a Host Device
Grid 1
grid of thread blocks
– Grid of blocks can be 1 or 2- Kernel Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
1
dimentional
– Thread blocks can be 1, 2, or Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
3-dimensional
• Different kernels can have
Grid 2
different grid/block
Kernel
configuration 2
• Threads from the same Block (1, 1)
block have access to a (0,0,1) (1,0,1) (2,0,1) (3,0,1)
shared memory and their

execution can be Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
synchronized Thread Thread Thread Thread

(0,1,0) (1,1,0) (2,1,0) (3,1,0)
41
Slide is courtesy of NVIDIA
GPU Memory Hierarchy
• Global (device) memory
– Accessible by all threads as well as host (CPU)
– Data lifetime is from allocation to deallocation
Device 0
memory
Host memory cudaMemcpy()
Device 1
memory
42
• Global (device) memory
Kernel 0
…
Per-device
Global
Kernel 1 Memory
43
• Local storage • Shared memory
– Each thread has own local – Each thread block has own
storage shared memory
– Mostly registers (managed by • Accessible only by threads
the compiler) within that block
– Data lifetime = thread lifetime – Data lifetime = block lifetime
Thread Block
Per-block
Per-thread
shared
local memory
memory
44
• 1D grid Grid of 2 thread blocks
block 0 block 1
– 2 thread blocks
Shared memory Shared memory
• 1D block
registers registers registers registers
– 2 threads
thread 0 thread 1 thread 0 thread 1
Global memory
Host memory
Constant memory
45
Host Device
CPU DRAM GPU
Multiprocessor
Multiprocessor
local Multiprocessor
chipset
global registers shared
memory
constant
DRAM texture constant and texture caches
Memory Location Cached Access Scope Lifetime

Register On-chip N/A R/W One thread Thread
Local Off-chip No R/W One thread Thread
Shared On-chip N/A R/W All threads in a block Block
Global Off-chip No R/W All threads + host Application
Constant Off-chip Yes R All threads + host Application
Texture Off-chip Yes R All threads + host Application
46

Singapore p1

Uploaded by

Copyright:

Available Formats

Singapore p1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Singapore p1

Uploaded by

Copyright:

Available Formats

Introduction to GPU

Volodymyr (Vlad) Kindratenko

SMC arranged as 2 streaming

SFU SFU SFU SFU precision (SP)

• 128-bit interface to off-

512-bit memory interconnect

• 4 T10 GPUs 4 GB GDDR3

ac01 ac02 ac32

– 1.3 GHz Tesla T10

– 4x4 GB GDDR3 SDRAM Tesla S1070

– You will see something like this printed out:

Mar 30, 2010

May 11, 2010

Questions? Contact Jeremy Enos jenos@ncsa.uiuc.edu

13:52:30 up 10 days, 22:23, 32 users, load average: 0.54, 0.29, 0.20

src1 src2 src3

ac01 ac02 ac32

• Run qsub -I -l walltime=02:00:00 to request a

qsub: waiting for job 64424.acm to start

ac01 ac02 ac32

• Compile & run CPU version

• Compile & run GPU version

int main(int argc, char **argv)

float *A = (float*)malloc(N * sizeof(float));

free(A); free(B); free(C);

float *devPtrA, *devPtrB, *devPtrC; Copy data from the

cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost);

int i = blockIdx.x * blockDim.x + threadIdx.x;

Device Kernel grid

Block 0 Block 1 Block 4 Block 5

• Threads from the same Block (1, 1)

block have access to a (0,0,1) (1,0,1) (2,0,1) (3,0,1)

shared memory and their

synchronized Thread Thread Thread Thread

Host memory cudaMemcpy()

Memory Location Cached Access Scope Lifetime

You might also like

float A = (float)malloc(N * sizeof(float));

float devPtrA, devPtrB, *devPtrC; Copy data from the