Nothing Special   »   [go: up one dir, main page]

лк CUDA - 1 PDCn

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

Parallel and Distributed computing

Lecture 14 / 15

Оverview of CUDA_1

Olessia Barkovskaya, KHTURE, 2015


Electronic Computers Department
Course information
 Instructor: Olessia Yurievna Barkovskaya

 Electronic Computers Department

 lesuwa@mail.ru

 consult hours:
Topics
 Introduction - today’s lecture

 System Architectures (Single Instruction - Single Data, Single Instruction -


Multiple Data, Multiple Instruction - Multiple Data, Shared Memory, Distributed
Memory, Cluster, Multiple Instruction - Single Data)

 Performance Analysis of parallel calculations (the speedup, efficiency, time


execution of algorithm…)

 Parallel numerical methods (Principles of Parallel Algorithm Design,


Analytical Modeling of Parallel Programs, Matrices Operations, Matrix-Vector
Operations, Graph Algorithms…)

 Software (Programming Using the Message-Passing Interface, OpenMP,


CUDA…)
What will you learn today?
 CUDA goals
 CUDA programming model
 CUDA memory model
What is CUDA?

CUDA- Compute Unified Device Architecture


General purpose computation on commodity graphics hardware
(GPUs)

CUDA is a scalable parallel programming model and a software environment for


parallel computing
 Minimal extensions to familiar C/C++ environment
 Heterogeneous serial-parallel programming model

NVIDIA’s TESLA architecture accelerates CUDA


 Expose the computational horsepower of NVIDIA GPUs
 Enable GPU computing

CUDA also maps well to multicore CPUs


CUDA philosophy
Up until now
 The GPU could only be programmed through a graphics
API
 The GPU memory could be read in a general way (gather)
but not written generally (no scatter).

CUDA
 A hardware and programming model that overcomes these
problems and expose the GPU as a truly generic data-
parallel computing device.
CUDA goals
 Scale to 100’s of cores, 1000’s of parallel threads

 Let programmers focus on parallel algorithms


 Not on the mechanics of a parallel programming language

 Enable heterogeneous systems (i.e. CPU + GPU)


 CPU and GPU are separate devices with separate DRAMs
CUDA defines

Programming model
Memory model

Definitions:
Device = GPU;
Host = CPU
Kernel = function that runs on the device

data-parallel, compute-intensive portions of applications running on


the host are off-loaded onto the device
History
The evolution of graphics card starts from
videoadapter MDA (Monochrome Display Adapter) - 1981. But it could
only support text mode of the monochrome displays. Output of graphics
memory content on the monitor was the main task for this graphics controller.

And only after18 years was created first GPU – graphical processing unit -
nVidia GeForce 256 .
The evolution of graphical processing
units
First generation (MPEG-2 support, 3d graphics)
Second generation (GPU could execute part of CPU
computations but with low performance)
Third generation (the opportunity to write a program
to compute the color of a pixel on the screen
appeared)
Fourth generation (shader support appeared, the
direction of general computing on GPU began to
develop – GPGPU (General Purpose GPU )).
Examples - nVidia GeForce 5 – 7 и ATI Radeon
9500 – X800.
Fifth generation (support integer operations, as well
as double-precision operations, first streaming
programming library for GPGPU appeared - nVidia
CUDA, AMD FireStream )
High Performance Computing -
Supercomputing with Tesla GPUs

TESLA DATA CENTER


SOLUTIONS
TESLAWORKSTATION
SOLUTIONS
Industry Software Solutions
CUDA Programming Model
A kernel is executed by a grid of
thread blocks.

A thread block is a batch of threads


that can cooperate with each other by:

Sharing data through shared


memory
Synchronizing their execution

Threads from different blocks


cannot cooperate
CUDA memory model

•Kernels are launched in grids (one kernel executes at a time)


•A thread block executes on one multiprocessor
•Several blocks can reside concurrently on one multiprocessor
Number is limited by multiprocessor resources
•Registers are partitioned among all resident threads
•Shared memory is partitioned among all resident thread blocks
Physical Memory Layout
 “Local” memory resides in device DRAM
 Use registers and shared memory to minimize local
memory use
 Host can read and write global memory but not
shared memory
Simple Hardware View
8-Series Architecture (G80)
 128 thread processors execute kernel threads
 16 multiprocessors, each contains
 8 thread processors
 Shared memory enables thread cooperation
Simple Hardware View
10-Series Architecture
 240 thread processors execute kernel threads
 30 multiprocessors, each contains
 8 thread processors
 One double-precision unit
 Shared memory enables thread cooperation
Execution Model
Threads are executed by thread processors

Thread blocks are executed on multiprocessors


Thread blocks do not migrate
Several concurrent thread blocks can reside on one
multiprocessor – limited by multiprocessor resources
(shared memory and register file)

A kernel is launched as a grid of thread blocks


Only one kernel can execute on a device at one time
CUDA Installation

 CUDA installation consists of


 Driver
 CUDA Toolkit (compiler, libraries)
 CUDA SDK (example codes)
CUDA Software Development
Managing Memory
 CPU and GPU have separate memory spaces

 Host (CPU) code manages device (GPU) memory:


 Allocate / free
 Copy data to and from device
 Applies to global device memory (DRAM)
GPU Memory Allocation / Release

 cudaMalloc(void **pointer, size_t nbytes)


 cudaMemset(void *pointer, int value, size_t count)
 cudaFree(void *pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Data Copies
cudaMemcpy ( void *dst,
void *src,
size_t nbytes,
enum cudaMemcpyKind direction);

 direction specifies locations (host or device) of src and dst


 Blocks CPU thread: returns after the copy is complete
 Doesn’t start copying until previous CUDA calls complete

 enum cudaMemcpyKind
 cudaMemcpyHostToDevice
 cudaMemcpyDeviceToHost
 cudaMemcpyDeviceToDevice
Host Synchronization
 All kernel launches are asynchronous
 control returns to CPU immediately
 kernel executes after all previous CUDA calls have completed

 cudaMemcpy() is synchronous
 control returns to CPU after copy completes
 copy starts after all previous CUDA calls have completed

 cudaThreadSynchronize()
 blocks until all previous CUDA calls complete
Example: Host Code
// allocate host memory
int numBytes = N * sizeof(float)
float* h_A = (float*) malloc(numBytes);

// allocate device memory


float* d_A = 0;
cudaMalloc((void**)&d_A, numbytes);

// copy data from host to device


cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

// execute the kernel


increment_gpu<<< N/blockSize, blockSize>>>(d_A, b);

// copy data from device back to host


cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

// free device memory


cudaFree(d_A);
Device Management
 CPU can query and select GPU devices
 cudaGetDeviceCount( int* count )
 cudaSetDevice( int device )
 cudaGetDevice( int *current_device )
 cudaGetDeviceProperties( cudaDeviceProp* prop, int device )
 cudaChooseDevice( int *device, cudaDeviceProp* prop )
 Multi-GPU setup:
 device 0 is used by default
 one CPU thread can control one GPU
 multiple CPU threads can control the same GPU
– calls are serialized by the driver
Context Management
 CUDA context analogous to CPU process
 Each context has its own address space

 Context created with cuCtxCreate()

 A host CPU thread can only have one context current at a time

 Each host CPU thread can have a stack of current contexts

 cuCtxPopCurrent() and cuCtxPushCurrent() can be used to detach and


push a context to a new thread

 cuCtxAttach() and cuCtxDetach() increment and decrement the usage count


and allow for interoperability of code in the same context (e.g. libraries)
CUDA libraries

CUDA libraries
BLAS
FFT
CUBLAS
 Implementation of BLAS (Basic Linear Algebra Subprograms) on top
of CUDA driver
 Self contained at the API level, no direct interaction with CUDA
driver

 Basic model for use


 Create matrix and vector objects in GPU memory space
 Fill objects with data
 Call CUBLAS functions
 Retrieve data

 CUBLAS library helper functions


 Creating and destroying data in GPU space
 Writing data to and retrieving data from objects
CUFFT
 CUFFT is the CUDA FFT library
 1D, 2D, and 3D transforms of complex and real
single-precision data
 Batched execution for multiple 1D transforms
in parallel
 1D transforms up to 8 million elements
 2D and 3D transforms in the range of [2,16384]
 In-place and out-of-place
Textbooks
 Nvidia CUDA Official site

 Jason Sanders, Edward Kandrot “CUDA by Example:


An Introduction to General-Purpose GPU Programming”

 Wen-mei W.Hwu “GPU Computing Gems Emerald


Edition (Applications of GPU Computing Series)”

 David B.Kirk, Wen-mei W.Hwu Programming


Massively Parallel Processors: A Hands-on Approach
(Applications of GPU Computing Series )

You might also like