Sun 2009

2009 15th International Conference on Parallel and Distributed Systems
Count Sort for GPU Computing

Weidong Sun 1,2 Zongmin Ma 2
1. School of Computer, Shenyang Institute of Aeronautical Engineering, Shenyang, China;
2. School of Information Science and Engineering, Northeastern University, Shenyang, China.
Email: sunweidong@syiae.edu.cn)
Abstract sort as a fundamental function block to sort or order the
Counting sort is a simple, stable and efficient sort suffixes [1, 5, 6], for its simplicity and efficiency.
algorithm with linear running time, which is a Furthermore, parallel suffix array construction [7] has
fundamental building block for many applications. This emerged as an interesting research field to meet
paper depicts the design issues of a data parallel high-performance computing requirement in full text
implementation of the count sort algorithm on a index, data compression and bioinformatics. However,
commodity multiprocessor GPU using the Compute we are not aware of any data parallel version exists.
Unified Device Architecture (CUDA) platform, both from Parallel implementation of count sort is the first step to
NVIDIA Corporation. The full parallel version runs conquer the problem in data parallel way. This is the
much faster than any serial implementation on CPU with subject of the present paper. We introduce data parallel
the loss of stability due to the limitation of the massive model in Section 2, readers familiar with CUDA
threads parallel model. But the thread-level parallel programming can simply pass this section. We describe
implementation still provides an efficient parallel sort the parallelization method and issues in Section 3 and
primitive for many applications, which do not require experiment results in Section 4. Section 5 concludes with
stable sort or can be adapted for unstable subroutines. the discussion of the advantages and limitations of the
data parallel implementation of the counting sort.
1. Introduction
Counting sort is a simple, stable and non 2. CUDA Parallel processing model
comparison sort with linear running time of Ĭ(n+k), Most personal computers today are equipped with
where n and k are the length of the input position array a hardware for 3D graphics acceleration called Graphics
and the counting array c respectively. The corresponding Processing Units (GPUs). The GPUs provide tremendous
values of a are stored in rank array r, r is used as the memory bandwidth and computational horsepower not
indices of c to count the occurrences of each rank value only for vertex and pixel processing pipelines but also for
of a, another array b is used to hold the output position non-graphical, general purpose (GPGPU) applications [8].
values. Counting sort is often used for processing The high-performance computing capability of GPUs in
individual digits in radix sort, and it also serves as a commodity now can match with the computational power
fundamental building block for many other applications. of high-end servers only available in computer centers
More specially, we will talk about the parallel [9].
implementation issues of count sort in some applications, In this paper, we particularly discuss our data
where stable sort is not necessary, for example, in suffix parallel implementation on CUDA introduced by
array construction [1] and exact repeats finding [2]. NVIDIA [10]. CUDA is a development toolkit, supported
The suffix array is a lexicographically sorted array on all NVIDIA GeForce, Quadro, and Tesla hardware
of all suffixes of a string, which is used widely in string architecture, allows programmers to develop GPGPU
matching, genome analysis [3] and text compression [4]. applications using the extended C programming language
Many suffix array construction algorithms employs count instead of early GPGPU platform using graphics APIs.
1521-9097/09 $26.00 © 2009 IEEE 919

DOI 10.1109/ICPADS.2009.30
The parallel programming model and software develop
Serial Code
environment of CUDA are designed to transparently Copy input data to GPU
scale GPU parallelism while providing low learning Parallel Threads˄Kernels˅
curve and various debugging tools for C programmers. SIMT Multiprocessor

The core of CUDA consists of three rank abstractions – a
hierarchy of thread groups, shared memories, and barrier
SM1 SM••• SMn
synchronization-which is especially well-suited to solve
problems that can be expressed as data-parallel
computations – the same code is executed on many data
Device Memory
elements in parallel. To date, CUDA has got more
widespread application than other GPGPU solutions.
Copy result to CPU
In CUDA, the GPU is viewed as a compute device
suitable for data parallel applications. It has its own
CPU GPU
device memory and may run a huge number of parallel
Figure 1. Simplified CUDA thread model
threads (Fig.1). Threads are grouped in blocks and blocks
are further aggregated to a grid. In current CUDA threads Threads may access data from various device
schedule model, 3D block index and 2D thread index are memory spaces on GPU during their executions (Fig.2).
supported for generating enough number of threads and Each thread has private registers and local memory, each
adopting for application data dimension. For CUDA 1.0, thread block has shared memory visible to all threads of
the maximum number of threads per block is 512 and the the block, and all threads of the grid have global memory
maximum size of each dimension of a grid of thread accessible by all threads and host code. All the memory
blocks is 65535. These hierarchical sets of threads are types mentioned above are random access memory, i.e.
executed on SIMT (single-instruction, multiple-thread) read/write, small amount of memory are read-only, they
multiprocessor as Kernels, also called device code are constant and texture memory, both of them are
comparing to code executed on CPU (named host code), cached, which provide highly efficient access for
blocks of threads are mapped to virtually arbitrary read-only data. On-chip shared memory provides much
number of streaming multiprocessors (SMs). The faster access rate than other types of memory which
multiprocessor creates, manages, and executes the huge reside in device memory (DRAM).
number concurrent threads in hardware with zero
scheduling overhead comparing to CPU threads. A single
instruction barrier synchronization together with Register
Per Thread
lightweight thread creation and zero-overhead thread
Faster Local Memory
scheduling efficiently support fine-grained parallelism.
Larger Per Block
Now GPU provides from several to many such SIMT Shared Memory
multiprocessors, furthermore, the parallel unit number Per Grid
Global Memory
continues to scale with Moore's law and different design
philosophy of GPU.
Figure 2. Simplified CUDA memory model
920
3. Parallelization method and issues while the step 2 and 4 are the most time consuming
To demonstrate the parallelization process and operations since indirect addressing instruction is used
effect, the count sort algorithm is divided into four steps, several times and complex caching mechanism of CPU
as in Table 1. The step 1 and 3 only occupy a small do not work well in this situation, which severely
amount of time during the execution of the whole decrease the performance of the algorithm.
algorithm due to fast direct addressing mode on CPU,
Table 1. Count sort steps and corresponding proportion in serial version
Steps Statement Function Quota
1 for (int i = 0; i <= K; i++) c[i] = 0; initialize count array 3%
2 for (int i = 0; i < n; i++) c[r[a[i]]]++; count each rank value occurrences 18%
3 for (int i = 0, sum = 0; i <= K; i++) exclusive prefix sums 6%

{int t = c[i]; c[i] = sum; sum += t;}
4 for (int i = 0; i < n; i++) b[c[r[a[i]]]++] = a[i]; sort 73%
After the analysis of the serial algorithm, the key the great parallelism of the kernel code. When lots of
parallel kernel design philosophy is clear, that is to threads increase the count array, the threads with the
maximize the parallel threads number to hide the same rank value will conflict with each other, as show in
memory access latency. In addition, indirect addressing Fig.3, so read-modify-write atomic operations is
instructions in step 2 and 4 make fast shared memory necessary, which is supported by CUDA as device
inapplicable in the scenario, so high latency global runtime component [10].
memory is the only choice which can also be hidden by
Threads
a[] 0 1 2 3 4 5 6 7 i 0 1 2 3 4 5 6 7
c[r[a[i]]]++
r[] 0 0 2 2 4 4 6 6 c[] 0 0 0 0 0 0 0 0
Figure 3. Memory access conflicts
3.1. Parallelization process and method of physical multiprocessor number and maximal threads
The step 1 is the simplest code block for number per block, the atomic accesses of count array
parallelization, which is a typical scatter operation in should be done in segmentation for large size input array.
GPGPU programming, i.e., each thread should set one Step 2 needs one atomic add operation, but step 4 needs
count array element zero. The set zero operation can run two atomic operations, first atomic add operation with
very fast on GPU, if we set the threads number equals the the original value of the count array element is stored in
size of count array c. local memory of each thread, then atomic assign the
The step 2 and 4 apply the same parallelization value of a[i] to b[t], where t is the stored private value.
technique, atomic operations, since memory access The atomic access of the same count array element will
conflicts occur as shown in Fig.3. In CUDA, since cause performance loss, since conflict threads have to be
atomic functions are only supported when maximal serialized. If the number of distinct rank values of r is
active threads number is less than or equal to the product very small, the overall efficiency of the parallel
921
algorithm will be severely played down. The last and efficient than serial code for large array. For small arrays,
important issue needs to be mentioned is that if duplicate it is executed on GPU not efficient as on CPU, but it
rank values exists, the paralleled sort block will not to be avoids the slow memory copy operation from device to
stable, i.e., the position orders of these rank values are host, so it is appropriate choice for our parallel count sort
not kept due to the scheduling undeterminism of parallel implementation. In addition, the scan algorithm should
threads. be called unsegmented scan more precisely, since it
The step 3 is the all-prefix-sums operation that includes recursive host function call, i.e., not all parts of
seems inherently sequential, but for which there is an the code runs on GPU. The full parallel implementation
efficient parallel algorithm [11]. The all-prefix-sums is named segmented scan [13], which is not of our
operation is defined as follows in [11]: interest, since it is about several times slower than
Definition: The all-prefix-sums operation takes a binary unsegmented scans for large array and occupies more
associative operator, and an array of n elements memory space [13].
[a0,a1, …, an-1] 3.2. Implementation issues
and returns the array Since we define three different dimension sizes of
[a0, (a0 a1), … , (a0 a1, …, an-1)] block and thread in steps 1, 3, and 2 and 4, and step 2 and
This type of all-prefix-sums is commonly known as 4 are not consecutive code blocks, each step is
inclusive scan, the other type of all-prefix-sums is implemented separately for clarity and efficiency. They
commonly known as exclusive scan [11]. are zero-count-array, count-each-rank-value-occurrences,
Definition: The exclusive scan operation takes a binary exclusive-prefix-sum and sort. The count array c used in
associative operator with an identity element I, and an these steps only exists in device memory, only the sort
array of n elements result b needs to be transferred back into host memory.
[a0, a1, … , an-1] Though several steps form the whole parallel sort
and returns the array algorithm, since kernel call cost is so low that it can be
[I, a0, (a0 a1), … , (a0 a1, …, an-2)] omitted, it will not decrease the whole algorithm
In this paper, we employ the CUDA exclusive scan performance and makes us easy to analyze performance
code implemented by Mark Harris [12], The scan gain in each step.
algorithm consists of two phases: the up-sweep phase Step 2 and 4 are the most time consuming part of
and down-sweep phase. The up-sweep phase only the algorithm, and they both need atomic operations. The
computing partial sums (it performs (n-1) adds) [11], the atomicity is only kept among the threads that can be
down-sweep using the partial sums and swap operations scheduling by the physical GPU, i.e., we must know the
to build the exclusive scan (it performs (n-1) adds and multiprocessor number MN of the GPU running the
(n-1) swap) [11]. The algorithm works fine to scan an program and maximal thread number per multiprocessor
array inside a thread block, maximal elements number is TN, both of them can be got at run time by device
limited to 1024 due to threads number limitation per management functions[14]. So the maximal thread
block. Fast share memory is used in both two phases for number per iteration ITN can be defined as
data locality in a block. For arbitrary (non-power-of-two) ITN=MN*TN
size array, the algorithm is first applied recursively for For large size input data of length n, at least ITN/n
each scan block to generate an array of block increments, iterations are executed to accomplish the two steps. Since
then each scan block is uniformly added by block the count sort algorithm deals with one dimension array,
increments[11]. This scan algorithm performs O(n) we only use x dimension of 2D thread index, which is set
operations with great parallelism, therefore it works more to equal TN. Two dimensions of 3D block index are used
922
to scale up the thread number for large ITN. The ITN workstation, having the AMD Athlon 64 X2 3800+ 2.00
guarantees the full utilization of the underlying GPU and GHz processor and NVidia GeForce 9600GSO graphic
atomicity of operations. The exclusive scan codes [11] cards. The 9600GSO has 256 MB of on-board RAM with
are modified to make use of two dimensions of 3D block 126 bit memory bandwidth and a G82 with 12 1.45GHz
index for the same reason. The same scaling method is multiprocessors. At the time of paper writing, the retail
also applied for large size input data in step 1 to achieve price of the 9600GSO video card is about $70, and the
highest parallelism, for step 1 is the only code block AMD Athlon 64 X2 CPU is about $35. The machine was
totally running on GPU. running Microsoft Windows XP Professional Edition SP3
Since stable serial sort and unstable parallel sort with CUDA 2.0 and Microsoft Visual Studio 2005. The
return different result array, we develop a tailored rank values are integers generated randomly by standard
program to test the correctness of the two results instead C++ pseudo-random function, the number of duplicate
of using standard CUDA comparison API. For terseness, rank values is calculated to show the memory conflicts
we do not discuss it in the paper. rate. The relative performances of count sort are firstly
measured by comparing the total run time of the GPU
4. Experiment results and CPU version code with about a half memory
We have tested the serial and parallel solutions on a conflicts, as shown in Table 2.
Table 2. Comparison of experiment results
#elements #duplicates CPU Time(ms) GPU Time(ms) Speedup
10000 4943 0. 55 0.79 0.70
50000 24957 3.63 1.73 2.10
100000 50076 10.22 2.81 3.64
500000 250083 96.76 13.75 7.04
1000000 500121 216.39 28.08 7.71
5000000 2507557 1192.85 147.54 8.08
The maximal speedup of GPU version over CPU in Table 3. The exclusive prefix sum step occupies about
version are about 8 times for large size of input data a half of total time, which is quite different from the
elements. For small size genome the GPU serial code. More efficient algorithm should be
implementation is slower than CPU implementation, developed to remove this performance bottleneck, which
since the startup overheads of the kernel calls are forms the crucial path of the whole sort algorithm.
relatively high and small enough data size prohibits the Surprisingly the sort step executes much faster than
full utilization of GPU computing power. The number of count step, though they share almost the same quota in
duplicate rank value also has a strong impact on the data the parallel code. The speedup of sort step reveals that
parallel count sort algorithm, since duplicates will reduce GPU can do complicated indirect memory accesses better
the parallelism of threads. For instance, the number of than CPU, since GPU do not have complex caching
input data elements is 5000000 with 2507557 duplicates, mechanism as in CPU, which may malfunction in the
if we change number of duplicate rank values to 0 and complicated memory accesses. The initialization step
4999999, the parallel code executing time will alter to gets the maximal acceleration due to broadcast enabled
93.99ms and 1607.467 accordingly. architecture of GPU, which provides high bandwidth for
To analyze the performance gain of each step in memory access.
parallel version, the time proportion of each step is given
923
Table 3. Speedup of each parallel step [4] Burrows M., Wheeler D.J., “A block-sorting lossless
Steps Function Quota Speedup data compression algorithm”, Technical Report. 124,
1 initialize count array <0.1% >100 Digital Systems Research Center, Palo Alto,
2 count rank value occurrences 28.1% 5.1 California, 1994.
3 exclusive prefix sums 41.9% 1.2 [5] Kärkkäinen J., Sanders P., “Simple linear work suffix
4 Sort 29.9% 19.6 array construction”, In Proc. 30th International
Conference on Automata, Languages and
Programming, Springer, 2003, 943–955.
5. Conclusion
[6] Puglisi S. J., Smyth W. F., Turpin A., “A taxonomy
The count sort is a simple and powerful building
of suffix array construction algorithms”, ACM
block for a broad range of applications. In this paper we
Computing Surveys 39(2), 2007, 1–31.
first depicted the efficient data parallel implementation of
[7] Fabian Kulla, Peter Sanders, “Scalable parallel
the algorithm on CUDA platform. The parallel code
suffix array construction”, Parallel Computing 33(9),
achieves a significant performance gain over the
2007, 605-612.
sequential implementation on CPU, which shows that
[8] Owens J., Luebke D., Govidaraju N., Harris M.,
various kinds of existing programs can be readily
Kruger J., Lefohn A., Purcell T., “A survey of
imported to the GPGPU domain with higher performance
general-purpose computation on graphics
and lower cost [15].
hardware”, Computer Graphics Forum 26(1),
Our implementation also reveals an intrinsic
2007, 80-113.
drawback of thread parallel computing model, the
[9] NVIDIA Tesla Personal Supercomputer,
nondeterminism of threads execution. The parallel thread
[http://www.nvidia.com/object/personal_superco
scheduling mechanism of CUDA can not guarantee the
mputing.html].
execution order of the threads, which changes the count
[10] CUDA Programming Guide Version 2.0,
sort algorithm from stable to unstable. But for many
Technical report, NVIDIA Corporation, 2008.
application without the need of stable sort, the data
[11] Guy E. Blelloch, “Prefix sums and their
parallel code still serve as an efficient and off the shelf
applications”, In John H. Reif (Ed.), Synthesis of
building block. In the future, we will test the parallel
Parallel Algorithms, Morgan Kaufmann, 1993,
count sort code for solving more complex problems
35-60.
(suffix array construction, exact repeats finding, etc.).
[12] Harris M., Sengupta S., Owens J. D., “Parallel
References
prefix sum (scan) with CUDA”. In GPU Gems 3,
[1] Manber U., Myers G., “Suffix arrays: A new method
Nguyen H., (Ed.), Addison Wesley, 2007,
for on-line string searches”, SIAM Journal of
851-876.
Computing 22(5), 1993, 935–948.
[13] Sengupta S., Harris M., Yao ZH., Owens J. D.,
[2] Weigong Sun, Zongmin Ma, “A fast exact repeats
“Scan primitives for GPU computing”, Graphics
search algorithm for genome analysis”, In Proc. 9th
Hardware, 2007, 97-106.
International Conference on Hybrid Intelligent
[14] NVIDIA CUDA Compute Unified Device
Systems(HIS-2009), IEEE Computer Society Press,
Architecture Reference Manual Version 2.0,
2009, 427-430.
Technical report, NVIDIA Corporation, 2008.
[3] Abouelhoda M.I., Kurtz S., Ohlebusch E., “The
[15] Nickolls J., Buck I., Garland M., Skadron K.,
enhanced suffix array and its applications to genome
“Scalable parallel programming with CUDA”,
analysis”, In Proc. 2nd Workshop on Algorithms in
ACM Queue 6 (2), 2008, 40-53.
Bioinformatics, Springer, 2002, 449–463.
924

Sun 2009

Uploaded by

Copyright:

Available Formats

Sun 2009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sun 2009

Uploaded by

Copyright:

Available Formats

2009 15th International Conference on Parallel and Distributed Systems

Count Sort for GPU Computing

1521-9097/09 $26.00 © 2009 IEEE 919

scale GPU parallelism while providing low learning Parallel Threads˄Kernels˅

curve and various debugging tools for C programmers. SIMT Multiprocessor

3 for (int i = 0, sum = 0; i <= K; i++) exclusive prefix sums 6%

4 for (int i = 0; i < n; i++) b[c[r[a[i]]]++] = a[i]; sort 73%

Figure 3. Memory access conflicts

50000 24957 3.63 1.73 2.10

100000 50076 10.22 2.81 3.64

500000 250083 96.76 13.75 7.04

1000000 500121 216.39 28.08 7.71

5000000 2507557 1192.85 147.54 8.08

You might also like