лк CUDA - 1 PDCn

Parallel and Distributed computing
Lecture 14 / 15
Оverview of CUDA_1
Olessia Barkovskaya, KHTURE, 2015

Electronic Computers Department
Course information
 Instructor: Olessia Yurievna Barkovskaya
 Electronic Computers Department
 lesuwa@mail.ru
 consult hours:
Topics
 Introduction - today’s lecture
 System Architectures (Single Instruction - Single Data, Single Instruction -

Multiple Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed
Memory, Cluster, Multiple Instruction - Single Data)
 Performance Analysis of parallel calculations (the speedup, efficiency, time

execution of algorithm…)
 Parallel numerical methods (Principles of Parallel Algorithm Design,

Analytical Modeling of Parallel Programs, Matrices Operations, Matrix-Vector
Operations, Graph Algorithms…)
 Software (Programming Using the Message-Passing Interface, OpenMP,

CUDA…)
What will you learn today?
 CUDA goals
 CUDA programming model
 CUDA memory model
What is CUDA?
CUDA- Compute Unified Device Architecture

General purpose computation on commodity graphics hardware
(GPUs)
CUDA is a scalable parallel programming model and a software environment for

parallel computing
 Minimal extensions to familiar C/C++ environment
 Heterogeneous serial-parallel programming model
NVIDIA’s TESLA architecture accelerates CUDA

 Expose the computational horsepower of NVIDIA GPUs
 Enable GPU computing
CUDA also maps well to multicore CPUs

CUDA philosophy
Up until now
 The GPU could only be programmed through a graphics
API
 The GPU memory could be read in a general way (gather)
but not written generally (no scatter).
CUDA
 A hardware and programming model that overcomes these
problems and expose the GPU as a truly generic data-
parallel computing device.
CUDA goals
 Scale to 100’s of cores, 1000’s of parallel threads
 Let programmers focus on parallel algorithms

 Not on the mechanics of a parallel programming language
 Enable heterogeneous systems (i.e. CPU + GPU)

 CPU and GPU are separate devices with separate DRAMs
CUDA defines
Programming model
Memory model
Definitions:
Device = GPU;
Host = CPU
Kernel = function that runs on the device
data-parallel, compute-intensive portions of applications running on

the host are off-loaded onto the device
History
The evolution of graphics card starts from
videoadapter MDA (Monochrome Display Adapter) - 1981. But it could
only support text mode of the monochrome displays. Output of graphics
memory content on the monitor was the main task for this graphics controller.
And only after18 years was created first GPU – graphical processing unit -
nVidia GeForce 256 .
The evolution of graphical processing
units
First generation (MPEG-2 support, 3d graphics)
Second generation (GPU could execute part of CPU
computations but with low performance)
Third generation (the opportunity to write a program
to compute the color of a pixel on the screen
appeared)
Fourth generation (shader support appeared, the
direction of general computing on GPU began to
develop – GPGPU (General Purpose GPU )).
Examples - nVidia GeForce 5 – 7 и ATI Radeon
9500 – X800.
Fifth generation (support integer operations, as well
as double-precision operations, first streaming
programming library for GPGPU appeared - nVidia
CUDA, AMD FireStream )
High Performance Computing -
Supercomputing with Tesla GPUs
TESLA DATA CENTER

SOLUTIONS
TESLAWORKSTATION
SOLUTIONS
Industry Software Solutions
CUDA Programming Model
A kernel is executed by a grid of
thread blocks.
A thread block is a batch of threads

that can cooperate with each other by:
Sharing data through shared

memory
Synchronizing their execution
Threads from different blocks

cannot cooperate
CUDA memory model
•Kernels are launched in grids (one kernel executes at a time)

•A thread block executes on one multiprocessor
•Several blocks can reside concurrently on one multiprocessor
Number is limited by multiprocessor resources
•Registers are partitioned among all resident threads
•Shared memory is partitioned among all resident thread blocks
Physical Memory Layout
 “Local” memory resides in device DRAM
 Use registers and shared memory to minimize local
memory use
 Host can read and write global memory but not
shared memory
Simple Hardware View
8-Series Architecture (G80)
 128 thread processors execute kernel threads
 16 multiprocessors, each contains
 8 thread processors
 Shared memory enables thread cooperation
Simple Hardware View
10-Series Architecture
 240 thread processors execute kernel threads
 30 multiprocessors, each contains
 8 thread processors
 One double-precision unit
 Shared memory enables thread cooperation
Execution Model
Threads are executed by thread processors
Thread blocks are executed on multiprocessors

Thread blocks do not migrate
Several concurrent thread blocks can reside on one
multiprocessor – limited by multiprocessor resources
(shared memory and register file)
A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time
CUDA Installation
 CUDA installation consists of

 Driver
 CUDA Toolkit (compiler, libraries)
 CUDA SDK (example codes)
CUDA Software Development
Managing Memory
 CPU and GPU have separate memory spaces
 Host (CPU) code manages device (GPU) memory:

 Allocate / free
 Copy data to and from device
 Applies to global device memory (DRAM)
GPU Memory Allocation / Release
 cudaMalloc(void **pointer, size_t nbytes)

 cudaMemset(void *pointer, int value, size_t count)
 cudaFree(void *pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Data Copies
cudaMemcpy ( void *dst,
void *src,
size_t nbytes,
enum cudaMemcpyKind direction);
 direction specifies locations (host or device) of src and dst

 Blocks CPU thread: returns after the copy is complete
 Doesn’t start copying until previous CUDA calls complete
 enum cudaMemcpyKind
 cudaMemcpyHostToDevice
 cudaMemcpyDeviceToHost
 cudaMemcpyDeviceToDevice
Host Synchronization
 All kernel launches are asynchronous
 control returns to CPU immediately
 kernel executes after all previous CUDA calls have completed
 cudaMemcpy() is synchronous
 control returns to CPU after copy completes
 copy starts after all previous CUDA calls have completed
 cudaThreadSynchronize()
 blocks until all previous CUDA calls complete
Example: Host Code
// allocate host memory
int numBytes = N * sizeof(float)
float* h_A = (float*) malloc(numBytes);
// allocate device memory

float* d_A = 0;
cudaMalloc((void**)&d_A, numbytes);
// copy data from host to device

cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// execute the kernel

increment_gpu<<< N/blockSize, blockSize>>>(d_A, b);
// copy data from device back to host

cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
// free device memory

cudaFree(d_A);
Device Management
 CPU can query and select GPU devices
 cudaGetDeviceCount( int* count )
 cudaSetDevice( int device )
 cudaGetDevice( int *current_device )
 cudaGetDeviceProperties( cudaDeviceProp* prop, int device )
 cudaChooseDevice( int *device, cudaDeviceProp* prop )
 Multi-GPU setup:
 device 0 is used by default
 one CPU thread can control one GPU
 multiple CPU threads can control the same GPU
– calls are serialized by the driver
Context Management
 CUDA context analogous to CPU process
 Each context has its own address space
 Context created with cuCtxCreate()
 A host CPU thread can only have one context current at a time
 Each host CPU thread can have a stack of current contexts
 cuCtxPopCurrent() and cuCtxPushCurrent() can be used to detach and

push a context to a new thread
 cuCtxAttach() and cuCtxDetach() increment and decrement the usage count

and allow for interoperability of code in the same context (e.g. libraries)
CUDA libraries
CUDA libraries
BLAS
FFT
CUBLAS
 Implementation of BLAS (Basic Linear Algebra Subprograms) on top
of CUDA driver
 Self contained at the API level, no direct interaction with CUDA
driver
 Basic model for use

 Create matrix and vector objects in GPU memory space
 Fill objects with data
 Call CUBLAS functions
 Retrieve data
 CUBLAS library helper functions

 Creating and destroying data in GPU space
 Writing data to and retrieving data from objects
CUFFT
 CUFFT is the CUDA FFT library
 1D, 2D, and 3D transforms of complex and real
single-precision data
 Batched execution for multiple 1D transforms
in parallel
 1D transforms up to 8 million elements
 2D and 3D transforms in the range of [2,16384]
 In-place and out-of-place
Textbooks
 Nvidia CUDA Official site
 Jason Sanders, Edward Kandrot “CUDA by Example:

An Introduction to General-Purpose GPU Programming”
 Wen-mei W.Hwu “GPU Computing Gems Emerald

Edition (Applications of GPU Computing Series)”
 David B.Kirk, Wen-mei W.Hwu Programming

Massively Parallel Processors: A Hands-on Approach
(Applications of GPU Computing Series )

лк CUDA - 1 PDCn

Uploaded by

Copyright:

Available Formats

лк CUDA - 1 PDCn

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

лк CUDA - 1 PDCn

Uploaded by

Copyright:

Available Formats

Parallel and Distributed computing

Olessia Barkovskaya, KHTURE, 2015

 Electronic Computers Department

 System Architectures (Single Instruction - Single Data, Single Instruction -

 Performance Analysis of parallel calculations (the speedup, efficiency, time

 Parallel numerical methods (Principles of Parallel Algorithm Design,

 Software (Programming Using the Message-Passing Interface, OpenMP,

CUDA- Compute Unified Device Architecture

CUDA is a scalable parallel programming model and a software environment for

NVIDIA’s TESLA architecture accelerates CUDA

CUDA also maps well to multicore CPUs

 Let programmers focus on parallel algorithms

 Enable heterogeneous systems (i.e. CPU + GPU)

data-parallel, compute-intensive portions of applications running on

TESLA DATA CENTER

A thread block is a batch of threads

Sharing data through shared

Threads from different blocks

•Kernels are launched in grids (one kernel executes at a time)

Thread blocks are executed on multiprocessors

A kernel is launched as a grid of thread blocks

 CUDA installation consists of

 Host (CPU) code manages device (GPU) memory:

 cudaMalloc(void **pointer, size_t nbytes)

 direction specifies locations (host or device) of src and dst

// allocate device memory

// copy data from host to device

// execute the kernel

// copy data from device back to host

// free device memory

 Context created with cuCtxCreate()

 Each host CPU thread can have a stack of current contexts

 cuCtxPopCurrent() and cuCtxPushCurrent() can be used to detach and

 cuCtxAttach() and cuCtxDetach() increment and decrement the usage count

 Basic model for use

 CUBLAS library helper functions

 Jason Sanders, Edward Kandrot “CUDA by Example:

 Wen-mei W.Hwu “GPU Computing Gems Emerald

 David B.Kirk, Wen-mei W.Hwu Programming

You might also like