OpenCL Tutorial - Basics

OpenCL Tutorial

Guillermo Marcus


Part I OpenCL Overview Hello Vector Coffee Break

Part II Reduction Matrix Multiply

About me
Dr. Guillermo Marcus

PhD from Heidelberg in Computer Science 2011 Head of the Scientific Computing Research Group until March 2013 NVIDIA (OptiX Group) from May 2013 Teached the ZITI Master Lecture in GPU Computing between 2011-2013

OpenCL Overview
Standarized language to program accelerators C-based, APIs and GPU code is C or C-like Compiles at runtime Supported by multiple hardware vendors NVIDIA, AMD, ARM, PowerVR, Altera While code is portable, optimizations are not!

OpenCL Basics
Application Models Execution Model Memory Model

Application Model
Activities are driven by the host computer Multiple platforms, multiple devices possible IO is an important part of the model

GPU Kernels
- Starts a computation in the GPU - "Launches" (starts) a collection of threads - Requires code to execute AND a specification (how the threads are organized) - Can be blocking or non-blocking

Work Item

Execution Model
Work Items
Kernel code "Serial" execution thread Private variables

int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+=4) c[i] = a[i] + b[i];

Work Groups
Synchronization inside the group Data sharing inside the group

Program Grid
Collection of Work Groups No synchronization No data Sharing

Work Items
A single thread in the GPU The are executed normally as SIM Thread code is the same for all work items Work items can have private variables Have an Unique ID inside the kernel
int a[N], b[N], c[N]; int i, tid; tid = getThreadID(); for(i=tid; i<N; i+= 4) c[i] = a[i] + b[i];

Single Instruction, Multiple Threads

Combines the flexibility of the thread model with the efficiency of the Single Instruction, Multiple Data architecture. Normally, there are many more threads than workers.
int a[N], b[N], c[N]; int tid; tid = getThreadID(); c[tid] = a[tid] + b[tid];

w or k 1 er w or k 2 er w or k 3 er w or k 4 er

Work Groups
Work Groups are collections of Work Items Items inside a Work Group ... are executed in parallel* share local data have a local ID can be organized as 1D,2D,3D* arrays Work Groups ... are independent of each other have an unique ID inside the kernel

Program Grid
Work Groups are organized as a 1D, 2D, 3D array Between Work Groups there is ... No communication No data synchronization In fact, often there is not even data coherency between work groups!

Memory Model
Hierarchical organization of areas: Host, Global, Local, Registers Moving data between areas is expensive Data coherency is not garanteed at all times or across all areas Every area has its own constraint set Controlled by attributes in the code definition

Memory Model Overview

Host Memory
Main Memory of the Host Computer Can move data only between the host and the GPU Global Memory Transfer is always initiated by the Host, can be Synchronous or Asynchronous Bandwidth is limited by the PCIe links

Global Memory
Main GPU Memory available to all threads Biggest in size, up to several GBs Huge bandwidth, but also huge latency typically 400-800 cycles not always cached Performance is very dependent of access patterns

Local Memory
Available to all threads inside a Work Group Limited in size (typical: 8KB-64KB) Latency comparable to registers Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns Used as scratchpad or cache of global memory

GPU Registers
Private to every thread Normally hidden, no direct access, optimized by the compiler Fastest access, only constrained in number of available registers Some platforms may use more registers than others..... depends on the hardware architecture

Constant Memory
Read only memory Cached Good for storing Look Up Tables and nonchangeable values It is normally a small area of the global memory

Private Memory
Unique to every Work Item Normally it is mapped first to registers, then to global memory when there is no more free registers

Kernel Specification
Defines the number and distribution of threads inside the kernel. A GPU program can be launched with different specifications, creating different kernels. The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.

Global and Local Settings (1D)

// Create kernel specification (ND range) NDRange global(VECT_SIZE); NDRange local(1);

// Create kernel specification (ND range) int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1); NDRange global(64*groups); NDRange local(64);

Global and Local Settings (2D)

// Create kernel specification (ND range) int gX = X_SIZE/4 + ((X_SIZE % 4 == 0) ? 0 : 1); int gY = Y_SIZE/3 + ((Y_SIZE % 3 == 0) ? 0 : 1); NDRange global(gX*4, gY*3); NDRange local(4,3);

Basic built-in functions values

