лк CUDA - 1 PDCn
лк CUDA - 1 PDCn
лк CUDA - 1 PDCn
Lecture 14 / 15
Оverview of CUDA_1
lesuwa@mail.ru
consult hours:
Topics
Introduction - today’s lecture
CUDA
A hardware and programming model that overcomes these
problems and expose the GPU as a truly generic data-
parallel computing device.
CUDA goals
Scale to 100’s of cores, 1000’s of parallel threads
Programming model
Memory model
Definitions:
Device = GPU;
Host = CPU
Kernel = function that runs on the device
And only after18 years was created first GPU – graphical processing unit -
nVidia GeForce 256 .
The evolution of graphical processing
units
First generation (MPEG-2 support, 3d graphics)
Second generation (GPU could execute part of CPU
computations but with low performance)
Third generation (the opportunity to write a program
to compute the color of a pixel on the screen
appeared)
Fourth generation (shader support appeared, the
direction of general computing on GPU began to
develop – GPGPU (General Purpose GPU )).
Examples - nVidia GeForce 5 – 7 и ATI Radeon
9500 – X800.
Fifth generation (support integer operations, as well
as double-precision operations, first streaming
programming library for GPGPU appeared - nVidia
CUDA, AMD FireStream )
High Performance Computing -
Supercomputing with Tesla GPUs
int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Data Copies
cudaMemcpy ( void *dst,
void *src,
size_t nbytes,
enum cudaMemcpyKind direction);
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Host Synchronization
All kernel launches are asynchronous
control returns to CPU immediately
kernel executes after all previous CUDA calls have completed
cudaMemcpy() is synchronous
control returns to CPU after copy completes
copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
blocks until all previous CUDA calls complete
Example: Host Code
// allocate host memory
int numBytes = N * sizeof(float)
float* h_A = (float*) malloc(numBytes);
A host CPU thread can only have one context current at a time
CUDA libraries
BLAS
FFT
CUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms) on top
of CUDA driver
Self contained at the API level, no direct interaction with CUDA
driver