Parallelization Techniques for Heterogeneous Multicores with Applications
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Sodsong, WasuweeAbstract
In the past decade, graphics processing units (GPUs) have gained wide-spread use as general purpose hardware accelerators. Equipped with several thousand cores, GPUs are suitable for data-intensive operations. Although a GPU provides a vast amount of raw, parallel compute power, ...
See moreIn the past decade, graphics processing units (GPUs) have gained wide-spread use as general purpose hardware accelerators. Equipped with several thousand cores, GPUs are suitable for data-intensive operations. Although a GPU provides a vast amount of raw, parallel compute power, it is nevertheless a daunting task to fully utilize this hardware resource. Doing so requires an in-depth understanding of the GPU architecture to utilize the software-exposed GPU memory hierarchy, and to mitigate main memory latencies. Because a GPU lacks complex control units, it under-performs for tasks with complex control flow. Control-flow intensive operations are thus more efficiently computed on a CPU. In contrast, CPUs lack ALUs and thus under-perform with data-intensive operations. In practice, we find applications to be composed of a mix of data-intensive operations and operations with complex control-flow. Heterogeneous computing aims at utilizing both the CPU and the GPU of a system. It offers the advantage of leveraging the key strengths of both architectures, while diminishing their weaknesses. This thesis proposes code-partitioning, which considers application characteristics and the capabilities of the underlying hardware to assign computations to either the CPU or the GPU. Dynamic scheduling techniques are proposed to leverage pipeline-parallelism and load-balance the work-load on a heterogeneous architecture. The proposed code-partitioning technique is applied with two major applications, JPEG decompression and Kronecker algebra-operations. The entropy decoding of JPEG decompression is difficult to parallelize because codewords are of variable length, and the start-position of a codeword in the bitstream is not known before the previous codeword has been decoded. The remaining JPEG decoding steps are compute-intensive with few dependencies. Similarly, Kronecker algebra, which has been shown to be effective with static program analysis, consists of data-intensive matrix operations. However, it has cross-iteration dependencies, such as bookkeeping of visited nodes, which is unsuitable for GPU computing. Despite improvement potential with a heterogeneous system, the domination of the JPEG format and the usefulness of Kronecker algebra, no approaches exist yet that are capable of joining forces of a system’s CPU and GPU. We investigate parallelization strategies that use heterogeneous multicores for JPEG decompression and Kronecker algebra. We propose algorithm-specific optimizations that minimize the known sequential bottlenecks. Our code-partitioning and scheduling scheme exploits task, data, and pipeline parallelism. We introduce an offline profiling step to determine the performance of a system’s CPU and GPU such that workloads are distributed accordingly. These applications are evaluated on several heterogeneous platforms, including an embedded system (for JPEG decompression). From the “lessons learned”, parallel software design patterns for heterogeneous computing have been distilled and put to work with the two major applications of this thesis.
See less
See moreIn the past decade, graphics processing units (GPUs) have gained wide-spread use as general purpose hardware accelerators. Equipped with several thousand cores, GPUs are suitable for data-intensive operations. Although a GPU provides a vast amount of raw, parallel compute power, it is nevertheless a daunting task to fully utilize this hardware resource. Doing so requires an in-depth understanding of the GPU architecture to utilize the software-exposed GPU memory hierarchy, and to mitigate main memory latencies. Because a GPU lacks complex control units, it under-performs for tasks with complex control flow. Control-flow intensive operations are thus more efficiently computed on a CPU. In contrast, CPUs lack ALUs and thus under-perform with data-intensive operations. In practice, we find applications to be composed of a mix of data-intensive operations and operations with complex control-flow. Heterogeneous computing aims at utilizing both the CPU and the GPU of a system. It offers the advantage of leveraging the key strengths of both architectures, while diminishing their weaknesses. This thesis proposes code-partitioning, which considers application characteristics and the capabilities of the underlying hardware to assign computations to either the CPU or the GPU. Dynamic scheduling techniques are proposed to leverage pipeline-parallelism and load-balance the work-load on a heterogeneous architecture. The proposed code-partitioning technique is applied with two major applications, JPEG decompression and Kronecker algebra-operations. The entropy decoding of JPEG decompression is difficult to parallelize because codewords are of variable length, and the start-position of a codeword in the bitstream is not known before the previous codeword has been decoded. The remaining JPEG decoding steps are compute-intensive with few dependencies. Similarly, Kronecker algebra, which has been shown to be effective with static program analysis, consists of data-intensive matrix operations. However, it has cross-iteration dependencies, such as bookkeeping of visited nodes, which is unsuitable for GPU computing. Despite improvement potential with a heterogeneous system, the domination of the JPEG format and the usefulness of Kronecker algebra, no approaches exist yet that are capable of joining forces of a system’s CPU and GPU. We investigate parallelization strategies that use heterogeneous multicores for JPEG decompression and Kronecker algebra. We propose algorithm-specific optimizations that minimize the known sequential bottlenecks. Our code-partitioning and scheduling scheme exploits task, data, and pipeline parallelism. We introduce an offline profiling step to determine the performance of a system’s CPU and GPU such that workloads are distributed accordingly. These applications are evaluated on several heterogeneous platforms, including an embedded system (for JPEG decompression). From the “lessons learned”, parallel software design patterns for heterogeneous computing have been distilled and put to work with the two major applications of this thesis.
See less
Date
2017-08-24Licence
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering and Information Technologies, School of Information TechnologiesAwarding institution
The University of SydneyYonsei University Korea
Share