article

CUDA-quicksort: an improved GPU-based implementation of quicksort

Authors:

Emanuele Manca,

Andrea Manconi,

Alessandro Orro,

Giuliano Armano,

Luciano MilanesiAuthors Info & Claims

Concurrency and Computation: Practice & Experience, Volume 28, Issue 1

Pages 21 - 43

https://doi.org/10.1002/cpe.3611

Published: 01 January 2016 Publication History

Abstract

Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General-purpose computing has been successfully used on Graphics Processing Units GPUs to parallelize some sorting algorithms. Two GPU-based implementations of the quicksort were presented in literature: the GPU-quicksort, a compute-unified device architecture CUDA iterative implementation, and the CUDA dynamic parallel CDP quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA-quicksort an iterative GPU-based implementation of the sorting algorithm. CUDA-quicksort has been designed starting from GPU-quicksort. Unlike GPU-quicksort, it uses atomic primitives to perform inter-block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA-quicksort is up to four times faster than GPU-quicksort and up to three times faster than CDP-quicksort. An in-depth analysis of the performance between CUDA-quicksort and GPU-quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA-quicksort. Experimental results show that CUDA-quicksort is faster than the CDP-quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.

References

[1]

Cederman D, Tsigas P. A practical quicksort algorithm for graphics processors. In the ACM Journal of Experimental Algorithmics JEA 2009; Volume 4 Issue 14.

[2]

Hoare CAR. Quicksort. Computer Journal 1962; Volume 4 Issue 5: pp.10-15.

[3]

CUDA Toolkit 6.0, documentation. Available from: "http://docs.nvidia.com/cuda/cuda-samples/index.html" {accessed on 10 September 2014}.

[4]

NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 6.0, 2014.

[5]

Manconi A, Orro A, Manca E, Armano G, Milanesi L. GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads. PloS one, 95, Public Library of Science, pp. e97277, 2014.

[6]

Sengupta S, Harris M, Zhang Y, Owens JD. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, California, 2007; pp.97-106.

Digital Library

[7]

Heidelberger P, Norton A, Robinson JT. Parallel quicksort using fetch-and-add. IEEE Transactions on Computers 1990; Volume 39 Issue 1: pp.133-138.

[8]

Tsigas P, Zhang Y. A simple, fast parallel implementation of quicksort and its performance evaluation on sun enterprise 10000. Proceedings of the 11th Euromicro Conference on Parallel Distributed and Network-based Processing, Genova, Italy, 2003; pp.372-381.

[9]

Parallel bitonic sort algorithm. Available from: "http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm" {accessed on 10 September 2014}.

[10]

Purcell J, Donner C, Cammarano M, Jensen HW, Hanrahan P. Photon mapping on programmable graphics hardware. Proceedings of the 2003 Annual ACM SIGGRAPH/Eurographics Conference on Graphics Hardware EGGH 03, Aire-la-Ville, Switzerland, 2003; pp.41-50.

[11]

Kapasi UJ, Dally WJ, Rixner S, Mattson PR, Owens JD, Khailany B. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture Micro-33, New York, USA, 2000; pp.159-170.

[12]

Kipfer P, Segal M, Westermann R. Uberow: a GPU-based particle engine. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics Conference on Graphics Hardware HWWS '04, New York, USA, 2004; pp.115-122.

Digital Library

[13]

Kipfer P, Westermann R. GPU Gems 2: In Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Ch. Improved GPU sorting, 2005; pp.205-222.

[14]

Govindaraju NK, Raghuvanshi N, Manocha D. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data SIGMOD '05, New York, USA, 2005; pp.611-622.

Digital Library

[15]

Dowd M, Perl Y, Rudolph L, Saks M. The periodic balanced sorting network. Journal of the ACM 1989; Volume 36 Issue 4: pp.738-757.

[16]

Govindaraju NK, Raghuvanshi N, Henson M, Manocha D. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. Technical report, University of North Carolina, Chapel Hill, 2005.

[17]

Greβ A, Zachmann G. GPU-Abisort: Optimal Parallel Sorting on Stream Architectures. Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06. IEEE Computer Society, Washington, DC, USA, 2006; pp.45-45. Available from: "http://dl.acm.org/citation.cfm?id=1898953.1898980" {accessed on 10 September 2014}.

[18]

Sintorn E, Assarsson U. Fast parallel GPU-sorting using a hybrid algorithm. Journal Parallel Distributed Computing October 2008; Volume 68 Issue 10: pp.1381-1388. Available from: "http://dx.doi.org/10.1016/j.jpdc.2008.05.012" {accessed on 10 September 2014}.

[19]

Leischner N, Osipov V, Sanders P, 2009. Gpu Sample Sort. CoRR abs/0909.5649.

[20]

Harris M, Sengupta S, Owens JD. Parallel prex sum scan with CUDA. In H. Nguyen, Editor, GPU Gems 3. Addison Wesley, 2007.

[21]

Helman DR, Bader DA, Jaja J. A randomized parallel sorting algorithm with an experimental study. Journal Parallel Distributed Computing 1998; Volume 1 Issue 52: pp.1-23.

[22]

Hoberock J, Bell N. Thrust: A Parallel Template Library, 2010.

[23]

Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In Parallel Distributed Processing. IPDPS 2009. IEEE International Symposium on, Rome, Italy, May 2009; pp.1-10.

Digital Library

[24]

Manconi A, Manca E, Orro A, Armano G, Gnocchi M, Moscatelli M, Milanesi L. G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read Depth Methods. Front. Bioeng. Biotechnol, 2015. 3:28.

Cited By

Maltenberger TIlic ITolovski IRabl TIves ZBonifati AEl Abbadi A(2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517842
Yang YWang XPeng S(2020)A Dynamic Protection Mechanism for GPU Memory OverflowNetwork and Parallel Computing10.1007/978-3-030-79478-1_3(30-40)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-79478-1_3
Ren BBalakrishna SJo YKrishnamoorthy SAgrawal KKulkarni M(2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
https://dl.acm.org/doi/10.1145/3365663
Show More Cited By

Recommendations

A CUDA implementation of the Continuous Space Language Model

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs
IWOCL '20: Proceedings of the International Workshop on OpenCL

Future HPC leadership computing systems for the United States Department of Energy will utilize GPUs for acceleration of scientific codes. These systems will utilize GPUs from various vendors which places a large focus on the performance portability of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience

Concurrency and Computation: Practice & Experience Volume 28, Issue 1

January 2016

183 pages

ISSN:1532-0626

EISSN:1532-0634

Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 01 January 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maltenberger TIlic ITolovski IRabl TIves ZBonifati AEl Abbadi A(2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517842
Yang YWang XPeng S(2020)A Dynamic Protection Mechanism for GPU Memory OverflowNetwork and Parallel Computing10.1007/978-3-030-79478-1_3(30-40)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-79478-1_3
Ren BBalakrishna SJo YKrishnamoorthy SAgrawal KKulkarni M(2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
https://dl.acm.org/doi/10.1145/3365663
Singh DJoshi IChoudhary J(2018)Survey of GPU Based Sorting AlgorithmsInternational Journal of Parallel Programming10.1007/s10766-017-0502-546:6(1017-1034)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0502-5
Ren SXu SYang G(2018)A Parallel Quicksort Algorithm on Manycore Processors in Sunway TaihuLightComputational Science – ICCS 201810.1007/978-3-319-93713-7_61(647-653)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1007/978-3-319-93713-7_61

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents