Sun 2009
Sun 2009
Sun 2009
920
3. Parallelization method and issues while the step 2 and 4 are the most time consuming
To demonstrate the parallelization process and operations since indirect addressing instruction is used
effect, the count sort algorithm is divided into four steps, several times and complex caching mechanism of CPU
as in Table 1. The step 1 and 3 only occupy a small do not work well in this situation, which severely
amount of time during the execution of the whole decrease the performance of the algorithm.
algorithm due to fast direct addressing mode on CPU,
Table 1. Count sort steps and corresponding proportion in serial version
Steps Statement Function Quota
1 for (int i = 0; i <= K; i++) c[i] = 0; initialize count array 3%
2 for (int i = 0; i < n; i++) c[r[a[i]]]++; count each rank value occurrences 18%
After the analysis of the serial algorithm, the key the great parallelism of the kernel code. When lots of
parallel kernel design philosophy is clear, that is to threads increase the count array, the threads with the
maximize the parallel threads number to hide the same rank value will conflict with each other, as show in
memory access latency. In addition, indirect addressing Fig.3, so read-modify-write atomic operations is
instructions in step 2 and 4 make fast shared memory necessary, which is supported by CUDA as device
inapplicable in the scenario, so high latency global runtime component [10].
memory is the only choice which can also be hidden by
Threads
a[] 0 1 2 3 4 5 6 7 i 0 1 2 3 4 5 6 7
c[r[a[i]]]++
r[] 0 0 2 2 4 4 6 6 c[] 0 0 0 0 0 0 0 0
3.1. Parallelization process and method of physical multiprocessor number and maximal threads
The step 1 is the simplest code block for number per block, the atomic accesses of count array
parallelization, which is a typical scatter operation in should be done in segmentation for large size input array.
GPGPU programming, i.e., each thread should set one Step 2 needs one atomic add operation, but step 4 needs
count array element zero. The set zero operation can run two atomic operations, first atomic add operation with
very fast on GPU, if we set the threads number equals the the original value of the count array element is stored in
size of count array c. local memory of each thread, then atomic assign the
The step 2 and 4 apply the same parallelization value of a[i] to b[t], where t is the stored private value.
technique, atomic operations, since memory access The atomic access of the same count array element will
conflicts occur as shown in Fig.3. In CUDA, since cause performance loss, since conflict threads have to be
atomic functions are only supported when maximal serialized. If the number of distinct rank values of r is
active threads number is less than or equal to the product very small, the overall efficiency of the parallel
921
algorithm will be severely played down. The last and efficient than serial code for large array. For small arrays,
important issue needs to be mentioned is that if duplicate it is executed on GPU not efficient as on CPU, but it
rank values exists, the paralleled sort block will not to be avoids the slow memory copy operation from device to
stable, i.e., the position orders of these rank values are host, so it is appropriate choice for our parallel count sort
not kept due to the scheduling undeterminism of parallel implementation. In addition, the scan algorithm should
threads. be called unsegmented scan more precisely, since it
The step 3 is the all-prefix-sums operation that includes recursive host function call, i.e., not all parts of
seems inherently sequential, but for which there is an the code runs on GPU. The full parallel implementation
efficient parallel algorithm [11]. The all-prefix-sums is named segmented scan [13], which is not of our
operation is defined as follows in [11]: interest, since it is about several times slower than
Definition: The all-prefix-sums operation takes a binary unsegmented scans for large array and occupies more
associative operator, and an array of n elements memory space [13].
[a0,a1, …, an-1] 3.2. Implementation issues
and returns the array Since we define three different dimension sizes of
[a0, (a0 a1), … , (a0 a1, …, an-1)] block and thread in steps 1, 3, and 2 and 4, and step 2 and
This type of all-prefix-sums is commonly known as 4 are not consecutive code blocks, each step is
inclusive scan, the other type of all-prefix-sums is implemented separately for clarity and efficiency. They
commonly known as exclusive scan [11]. are zero-count-array, count-each-rank-value-occurrences,
Definition: The exclusive scan operation takes a binary exclusive-prefix-sum and sort. The count array c used in
associative operator with an identity element I, and an these steps only exists in device memory, only the sort
array of n elements result b needs to be transferred back into host memory.
[a0, a1, … , an-1] Though several steps form the whole parallel sort
and returns the array algorithm, since kernel call cost is so low that it can be
[I, a0, (a0 a1), … , (a0 a1, …, an-2)] omitted, it will not decrease the whole algorithm
In this paper, we employ the CUDA exclusive scan performance and makes us easy to analyze performance
code implemented by Mark Harris [12], The scan gain in each step.
algorithm consists of two phases: the up-sweep phase Step 2 and 4 are the most time consuming part of
and down-sweep phase. The up-sweep phase only the algorithm, and they both need atomic operations. The
computing partial sums (it performs (n-1) adds) [11], the atomicity is only kept among the threads that can be
down-sweep using the partial sums and swap operations scheduling by the physical GPU, i.e., we must know the
to build the exclusive scan (it performs (n-1) adds and multiprocessor number MN of the GPU running the
(n-1) swap) [11]. The algorithm works fine to scan an program and maximal thread number per multiprocessor
array inside a thread block, maximal elements number is TN, both of them can be got at run time by device
limited to 1024 due to threads number limitation per management functions[14]. So the maximal thread
block. Fast share memory is used in both two phases for number per iteration ITN can be defined as
data locality in a block. For arbitrary (non-power-of-two) ITN=MN*TN
size array, the algorithm is first applied recursively for For large size input data of length n, at least ITN/n
each scan block to generate an array of block increments, iterations are executed to accomplish the two steps. Since
then each scan block is uniformly added by block the count sort algorithm deals with one dimension array,
increments[11]. This scan algorithm performs O(n) we only use x dimension of 2D thread index, which is set
operations with great parallelism, therefore it works more to equal TN. Two dimensions of 3D block index are used
922
to scale up the thread number for large ITN. The ITN workstation, having the AMD Athlon 64 X2 3800+ 2.00
guarantees the full utilization of the underlying GPU and GHz processor and NVidia GeForce 9600GSO graphic
atomicity of operations. The exclusive scan codes [11] cards. The 9600GSO has 256 MB of on-board RAM with
are modified to make use of two dimensions of 3D block 126 bit memory bandwidth and a G82 with 12 1.45GHz
index for the same reason. The same scaling method is multiprocessors. At the time of paper writing, the retail
also applied for large size input data in step 1 to achieve price of the 9600GSO video card is about $70, and the
highest parallelism, for step 1 is the only code block AMD Athlon 64 X2 CPU is about $35. The machine was
totally running on GPU. running Microsoft Windows XP Professional Edition SP3
Since stable serial sort and unstable parallel sort with CUDA 2.0 and Microsoft Visual Studio 2005. The
return different result array, we develop a tailored rank values are integers generated randomly by standard
program to test the correctness of the two results instead C++ pseudo-random function, the number of duplicate
of using standard CUDA comparison API. For terseness, rank values is calculated to show the memory conflicts
we do not discuss it in the paper. rate. The relative performances of count sort are firstly
measured by comparing the total run time of the GPU
4. Experiment results and CPU version code with about a half memory
We have tested the serial and parallel solutions on a conflicts, as shown in Table 2.
Table 2. Comparison of experiment results
#elements #duplicates CPU Time(ms) GPU Time(ms) Speedup
10000 4943 0. 55 0.79 0.70
The maximal speedup of GPU version over CPU in Table 3. The exclusive prefix sum step occupies about
version are about 8 times for large size of input data a half of total time, which is quite different from the
elements. For small size genome the GPU serial code. More efficient algorithm should be
implementation is slower than CPU implementation, developed to remove this performance bottleneck, which
since the startup overheads of the kernel calls are forms the crucial path of the whole sort algorithm.
relatively high and small enough data size prohibits the Surprisingly the sort step executes much faster than
full utilization of GPU computing power. The number of count step, though they share almost the same quota in
duplicate rank value also has a strong impact on the data the parallel code. The speedup of sort step reveals that
parallel count sort algorithm, since duplicates will reduce GPU can do complicated indirect memory accesses better
the parallelism of threads. For instance, the number of than CPU, since GPU do not have complex caching
input data elements is 5000000 with 2507557 duplicates, mechanism as in CPU, which may malfunction in the
if we change number of duplicate rank values to 0 and complicated memory accesses. The initialization step
4999999, the parallel code executing time will alter to gets the maximal acceleration due to broadcast enabled
93.99ms and 1607.467 accordingly. architecture of GPU, which provides high bandwidth for
To analyze the performance gain of each step in memory access.
parallel version, the time proportion of each step is given
923
Table 3. Speedup of each parallel step [4] Burrows M., Wheeler D.J., “A block-sorting lossless
Steps Function Quota Speedup data compression algorithm”, Technical Report. 124,
1 initialize count array <0.1% >100 Digital Systems Research Center, Palo Alto,
2 count rank value occurrences 28.1% 5.1 California, 1994.
3 exclusive prefix sums 41.9% 1.2 [5] Kärkkäinen J., Sanders P., “Simple linear work suffix
4 Sort 29.9% 19.6 array construction”, In Proc. 30th International
Conference on Automata, Languages and
Programming, Springer, 2003, 943–955.
5. Conclusion
[6] Puglisi S. J., Smyth W. F., Turpin A., “A taxonomy
The count sort is a simple and powerful building
of suffix array construction algorithms”, ACM
block for a broad range of applications. In this paper we
Computing Surveys 39(2), 2007, 1–31.
first depicted the efficient data parallel implementation of
[7] Fabian Kulla, Peter Sanders, “Scalable parallel
the algorithm on CUDA platform. The parallel code
suffix array construction”, Parallel Computing 33(9),
achieves a significant performance gain over the
2007, 605-612.
sequential implementation on CPU, which shows that
[8] Owens J., Luebke D., Govidaraju N., Harris M.,
various kinds of existing programs can be readily
Kruger J., Lefohn A., Purcell T., “A survey of
imported to the GPGPU domain with higher performance
general-purpose computation on graphics
and lower cost [15].
hardware”, Computer Graphics Forum 26(1),
Our implementation also reveals an intrinsic
2007, 80-113.
drawback of thread parallel computing model, the
[9] NVIDIA Tesla Personal Supercomputer,
nondeterminism of threads execution. The parallel thread
[http://www.nvidia.com/object/personal_superco
scheduling mechanism of CUDA can not guarantee the
mputing.html].
execution order of the threads, which changes the count
[10] CUDA Programming Guide Version 2.0,
sort algorithm from stable to unstable. But for many
Technical report, NVIDIA Corporation, 2008.
application without the need of stable sort, the data
[11] Guy E. Blelloch, “Prefix sums and their
parallel code still serve as an efficient and off the shelf
applications”, In John H. Reif (Ed.), Synthesis of
building block. In the future, we will test the parallel
Parallel Algorithms, Morgan Kaufmann, 1993,
count sort code for solving more complex problems
35-60.
(suffix array construction, exact repeats finding, etc.).
[12] Harris M., Sengupta S., Owens J. D., “Parallel
References
prefix sum (scan) with CUDA”. In GPU Gems 3,
[1] Manber U., Myers G., “Suffix arrays: A new method
Nguyen H., (Ed.), Addison Wesley, 2007,
for on-line string searches”, SIAM Journal of
851-876.
Computing 22(5), 1993, 935–948.
[13] Sengupta S., Harris M., Yao ZH., Owens J. D.,
[2] Weigong Sun, Zongmin Ma, “A fast exact repeats
“Scan primitives for GPU computing”, Graphics
search algorithm for genome analysis”, In Proc. 9th
Hardware, 2007, 97-106.
International Conference on Hybrid Intelligent
[14] NVIDIA CUDA Compute Unified Device
Systems(HIS-2009), IEEE Computer Society Press,
Architecture Reference Manual Version 2.0,
2009, 427-430.
Technical report, NVIDIA Corporation, 2008.
[3] Abouelhoda M.I., Kurtz S., Ohlebusch E., “The
[15] Nickolls J., Buck I., Garland M., Skadron K.,
enhanced suffix array and its applications to genome
“Scalable parallel programming with CUDA”,
analysis”, In Proc. 2nd Workshop on Algorithms in
ACM Queue 6 (2), 2008, 40-53.
Bioinformatics, Springer, 2002, 449–463.
924