Abstract
Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming
Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution
configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated
to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads,
called warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., warps of a TB may
finish the kernel execution at different points in time, causing the faster warps to wait for their slower sibling warps. This, in
effect, reduces the utilization of resources of SMs and hence the performance of the GPU.
We propose a simple and elegant technique to eliminate the waiting time of warps at the end of kernel execution and
improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual warps, and
enables warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting
for their sibling warps. We propose simple source to source transformations to use virtual thread blocks and virtual warps.
Further, this technique enables us to design a warp scheduling algorithm that is aware of the progress made by the virtual
thread blocks and virtual warps, and uses this knowledge to prioritise warps effectively. Evaluation on a diverse set of
kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean
improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) warp scheduler and 1.09x over Loose
Round Robin (LRR) warp scheduler.