Nothing Special   »   [go: up one dir, main page]

skip to main content
article

CUDA-quicksort: an improved GPU-based implementation of quicksort

Published: 01 January 2016 Publication History

Abstract

Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General-purpose computing has been successfully used on Graphics Processing Units GPUs to parallelize some sorting algorithms. Two GPU-based implementations of the quicksort were presented in literature: the GPU-quicksort, a compute-unified device architecture CUDA iterative implementation, and the CUDA dynamic parallel CDP quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA-quicksort an iterative GPU-based implementation of the sorting algorithm. CUDA-quicksort has been designed starting from GPU-quicksort. Unlike GPU-quicksort, it uses atomic primitives to perform inter-block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA-quicksort is up to four times faster than GPU-quicksort and up to three times faster than CDP-quicksort. An in-depth analysis of the performance between CUDA-quicksort and GPU-quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA-quicksort. Experimental results show that CUDA-quicksort is faster than the CDP-quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.

References

[1]
Cederman D, Tsigas P. A practical quicksort algorithm for graphics processors. In the ACM Journal of Experimental Algorithmics JEA 2009; Volume 4 Issue 14.
[2]
Hoare CAR. Quicksort. Computer Journal 1962; Volume 4 Issue 5: pp.10-15.
[3]
CUDA Toolkit 6.0, documentation. Available from: "http://docs.nvidia.com/cuda/cuda-samples/index.html" {accessed on 10 September 2014}.
[4]
NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 6.0, 2014.
[5]
Manconi A, Orro A, Manca E, Armano G, Milanesi L. GPU-BSM: A GPU-Based Tool to Map Bisulfite-Treated Reads. PloS one, 95, Public Library of Science, pp. e97277, 2014.
[6]
Sengupta S, Harris M, Zhang Y, Owens JD. Scan primitives for GPU computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, California, 2007; pp.97-106.
[7]
Heidelberger P, Norton A, Robinson JT. Parallel quicksort using fetch-and-add. IEEE Transactions on Computers 1990; Volume 39 Issue 1: pp.133-138.
[8]
Tsigas P, Zhang Y. A simple, fast parallel implementation of quicksort and its performance evaluation on sun enterprise 10000. Proceedings of the 11th Euromicro Conference on Parallel Distributed and Network-based Processing, Genova, Italy, 2003; pp.372-381.
[9]
Parallel bitonic sort algorithm. Available from: "http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm" {accessed on 10 September 2014}.
[10]
Purcell J, Donner C, Cammarano M, Jensen HW, Hanrahan P. Photon mapping on programmable graphics hardware. Proceedings of the 2003 Annual ACM SIGGRAPH/Eurographics Conference on Graphics Hardware EGGH 03, Aire-la-Ville, Switzerland, 2003; pp.41-50.
[11]
Kapasi UJ, Dally WJ, Rixner S, Mattson PR, Owens JD, Khailany B. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture Micro-33, New York, USA, 2000; pp.159-170.
[12]
Kipfer P, Segal M, Westermann R. Uberow: a GPU-based particle engine. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics Conference on Graphics Hardware HWWS '04, New York, USA, 2004; pp.115-122.
[13]
Kipfer P, Westermann R. GPU Gems 2: In Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Ch. Improved GPU sorting, 2005; pp.205-222.
[14]
Govindaraju NK, Raghuvanshi N, Manocha D. Fast and approximate stream mining of quantiles and frequencies using graphics processors. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data SIGMOD '05, New York, USA, 2005; pp.611-622.
[15]
Dowd M, Perl Y, Rudolph L, Saks M. The periodic balanced sorting network. Journal of the ACM 1989; Volume 36 Issue 4: pp.738-757.
[16]
Govindaraju NK, Raghuvanshi N, Henson M, Manocha D. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. Technical report, University of North Carolina, Chapel Hill, 2005.
[17]
Greβ A, Zachmann G. GPU-Abisort: Optimal Parallel Sorting on Stream Architectures. Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06. IEEE Computer Society, Washington, DC, USA, 2006; pp.45-45. Available from: "http://dl.acm.org/citation.cfm?id=1898953.1898980" {accessed on 10 September 2014}.
[18]
Sintorn E, Assarsson U. Fast parallel GPU-sorting using a hybrid algorithm. Journal Parallel Distributed Computing October 2008; Volume 68 Issue 10: pp.1381-1388. Available from: "http://dx.doi.org/10.1016/j.jpdc.2008.05.012" {accessed on 10 September 2014}.
[19]
Leischner N, Osipov V, Sanders P, 2009. Gpu Sample Sort. CoRR abs/0909.5649.
[20]
Harris M, Sengupta S, Owens JD. Parallel prex sum scan with CUDA. In H. Nguyen, Editor, GPU Gems 3. Addison Wesley, 2007.
[21]
Helman DR, Bader DA, Jaja J. A randomized parallel sorting algorithm with an experimental study. Journal Parallel Distributed Computing 1998; Volume 1 Issue 52: pp.1-23.
[22]
Hoberock J, Bell N. Thrust: A Parallel Template Library, 2010.
[23]
Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In Parallel Distributed Processing. IPDPS 2009. IEEE International Symposium on, Rome, Italy, May 2009; pp.1-10.
[24]
Manconi A, Manca E, Orro A, Armano G, Gnocchi M, Moscatelli M, Milanesi L. G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read Depth Methods. Front. Bioeng. Biotechnol, 2015. 3:28.

Cited By

View all
  • (2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
  • (2020)A Dynamic Protection Mechanism for GPU Memory OverflowNetwork and Parallel Computing10.1007/978-3-030-79478-1_3(30-40)Online publication date: 28-Sep-2020
  • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience  Volume 28, Issue 1
January 2016
183 pages
ISSN:1532-0626
EISSN:1532-0634
Issue’s Table of Contents

Publisher

John Wiley and Sons Ltd.

United Kingdom

Publication History

Published: 01 January 2016

Author Tags

  1. CUDA
  2. GPU
  3. high performance computing
  4. quick sort

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
  • (2020)A Dynamic Protection Mechanism for GPU Memory OverflowNetwork and Parallel Computing10.1007/978-3-030-79478-1_3(30-40)Online publication date: 28-Sep-2020
  • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
  • (2018)Survey of GPU Based Sorting AlgorithmsInternational Journal of Parallel Programming10.1007/s10766-017-0502-546:6(1017-1034)Online publication date: 1-Dec-2018
  • (2018)A Parallel Quicksort Algorithm on Manycore Processors in Sunway TaihuLightComputational Science – ICCS 201810.1007/978-3-319-93713-7_61(647-653)Online publication date: 11-Jun-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media