PORPLE: An extensible optimizer for portable data placement on GPU

G Chen, B Wu, D Li, X Shen - 2014 47th Annual IEEE/ACM …, 2014 - ieeexplore.ieee.org
G Chen, B Wu, D Li, X Shen
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014ieeexplore.ieee.org
GPU is often equipped with complex memory systems, including globalmemory, texture
memory, shared memory, constant memory, and variouslevels of cache. Where to place the
data is important for theperformance of a GPU program. However, the decision is difficult for
aprogrammer to make because of architecture complexity and thesensitivity of suitable data
placements to input and architecturechanges. This paper presents PORPLE, a portable data
placement engine thatenables a new way to solve the data placement problem. PORPLE …
GPU is often equipped with complex memory systems, including globalmemory, texture memory, shared memory, constant memory, and variouslevels of cache. Where to place the data is important for theperformance of a GPU program. However, the decision is difficult for aprogrammer to make because of architecture complexity and thesensitivity of suitable data placements to input and architecturechanges.This paper presents PORPLE, a portable data placement engine thatenables a new way to solve the data placement problem. PORPLE consistsof a mini specification language, a source-to-source compiler, and a runtime data placer. The language allows an easy description of amemory system; the compiler transforms a GPU program into a formamenable to runtime profiling and data placement; the placer, based onthe memory description and data access patterns, identifies on the flyappropriate placement schemes for data and places themaccordingly. PORPLE is distinctive in being adaptive to program inputsand architecture changes, being transparent to programmers (in mostcases), and being extensible to new memory architectures. Ourexperiments on three types of GPU systems show that PORPLE is able toconsistently find optimal or near-optimal placement despite the largedifferences among GPU architectures and program inputs, yielding up to2.08X (1.59X on average) speedups on a set of regular and irregularGPU benchmarks.
ieeexplore.ieee.org