article

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Authors:

źUkasz Jarząbek,

Paweł CzarnulAuthors Info & Claims

The Journal of Supercomputing, Volume 73, Issue 12

Pages 5378 - 5401

https://doi.org/10.1007/s11227-017-2091-x

Published: 01 December 2017 Publication History

Abstract

The aim of this paper is to evaluate performance of new CUDA mechanisms--unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.

References

[1]

Adinetz A (2014) Adaptive parallel computation with CUDA dynamic parallelism. https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/. Accessed 17 Feb 2016

[2]

Aliaga JI, Davidovic D, Pérez J, Quintana-Ortí ES (2015) Harnessing CUDA dynamic parallelism for the solution of sparse linear systems. In: Joubert GR, Leather H, Parsons M, Peters FJ, Sawyer M (eds.) Parallel Computing: On the Road to Exascale, Proceedings of the International Conference on Parallel Computing, ParCo 2015, 1---4 September 2015, Advances in parallel computing, vol 27. IOS Press, Edinburgh, pp 217---226.

[3]

Caldwell C, Goldbach's conjecture. http://primes.utm.edu/glossary/page.php?sort=GoldbachConjecture. Accessed 10 June 2016

[4]

Czarnul P (2003) Programming, tuning and automatic parallelization of irregular divide-and-conquer applications in DAMPVM/DAC. IJHPCA 17(1):77---93.

Digital Library

[5]

Czarnul P (2016) Benchmarking performance of a hybrid intel xeon/xeon phi system for parallel computation of similarity measures between large vectors. Int J Parallel Program.

Digital Library

[6]

Czarnul P (2016) Parallelization of divide-and-conquer applications on intel xeon phi with an OpenMP based framework. Springer International Publishing, Cham, pp 99---111.

[7]

Czarnul P, Grzeda K (2004) Parallel simulations of electrophysiological phenomena in myocardium on large 32 and 64-bit linux clusters. In: Kranzlmüller D, Kacsuk P, Dongarra J (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, Proceedings, Lecture Notes in Computer Science, vol 3241. Springer, Budapest, Sept 19---22, 2004, pp 234---241.

[8]

DiMarco J, Taufer M (2013) Performance impact of dynamic parallelism on different clustering algorithms. In: SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, pp 87520E---87520E

[9]

Guy R (2013) Unsolved problems in number theory. Springer Science & Business Media, Berlin

[10]

Halliday D, Resnick R, Walker J (2013) Fundamentals of physics extended, 10th edn. Wiley, London

[11]

Jones S (2012) How tesla k20 speeds quicksort, a familiar comp-sci code. https://blogs.nvidia.com/blog/2012/09/12/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/. Accessed 11 June 2016

[12]

Joseph J, Keville K (2015) An evaluation of CUDA unified memory access on NVIDIA tegra k1. Waltham, MA USA. In: IEEE High Performance Extreme Computing Conference (HPEC'15) 19th Annual HPEC Conference

[13]

Khronos OpenCL Working Group, Editor: Lee Howes: The opencl specification version: 2.1, document revision: 23 (2015). https://www.khronos.org/registry/OpenCL/specs/opencl-2.1.pdf

[14]

Landaverde R, Zhang T, Coskun AK, Herbordt M (2014) An investigation of unified memory access performance in CUDA. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp 1---6

[15]

Li D, Wu H, Becchi M (2015) Exploiting dynamic parallelism to efficiently support irregular nested loops on GPUS. In: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, COSMIC'15. ACM, New York, pp 5:1---5:1.

Digital Library

[16]

Li W, Jin G, Cui X, See S (2015) An evaluation of unified memory technology on nvidia gpus. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 1092---1098.

[17]

Mehta V (2015) Exploiting CUDA dynamic parallelism for low power arm based prototypes. In: GPU Technology Conference, San Jose. http://on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf

[18]

Mei G (2014) Evaluating the power of GPU acceleration for IDW interpolation algorithm. Sci World J 2014. Article ID 171574.

[19]

Negrut D, Serban R, Li A, Seidl A (2014) Unified memory in CUDA 6.0. a brief overview of related data access and transfer issues. Tech. Rep. TR-2014-09, University of Wisconsin---Madison

[20]

NVIDIA Corporation: Dynamic Parallelism in CUDA (2012). http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

[21]

NVIDIA Corporation: NVIDIA CUDA C Programming Guide (2017). http://docs.nvidia.com/cuda/cuda-c-programming-guide

[22]

Plauth M, Feinbube F, Schlegel F, Polze A (2015) Using dynamic parallelism for fine-grained, irregular workloads: a case study of the $$n$$n-queens problem. In: 2015 3rd International Symposium on Computing and Networking (CANDAR), pp 404---407.

Digital Library

[23]

Plauth M, Feinbube F, Schlegel F, Polze A (2016) A performance evaluation of dynamic parallelism for fine-grained, irregular workloads. Int J Netw Comput 6(2):212---229. http://www.ijnc.org/index.php/ijnc/article/view/126

[24]

Sakharnykh N (2015) Combine openacc and unified memory for productivity and performance. https://devblogs.nvidia.com/parallelforall/combine-openacc-unified-memory-productivity-performance/

[25]

Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming, 1st edn. Addison-Wesley Professional, Reading

Digital Library

[26]

Souto RP, Osthoff C, de Vasconcelos AT, Augusto DA, da Silva Dias PL, Rodriguez A, Trelles O, Ujaldon M (2014) Applying GPU dynamic parallelism to high-performance normalization of gene expressions. GPU Technology Conference, San Jose. http://on-demand.gputechconf.com/gtc/2014/poster/pdf/P4209_biofinformatics_sort_dynamic_parallelism.pdf

[27]

Theano Development Team (2016) Theano: a python framework for fast computation of mathematical expressions. http://arxiv.org/abs/1605.02688

[28]

Wang J, Yalamanchili S (2014) Characterization and analysis of dynamic parallelism in unstructured GPU applications. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp 51---60.

[29]

Wilkinson B, Allen M (2004) Parallel programming: techniques and applications using networked workstations and parallel computers, edition edn. Pearson. ISBN 978-0131405639

Digital Library

[30]

Zhang P, Holk E, Matty J, Misurda S, Zalewski M, Chu J, McMillan S, Lumsdaine A (2015) Dynamic parallelism for simple and efficient GPU graph algorithms. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3'15. ACM, New York, pp 11:1---11:4.

Digital Library

Cited By

Czarnul PMatuszek MKrzywaniak A(2024)Teaching High–performance Computing Systems – A Case Study with Parallel Programming APIs: MPI, OpenMP and CUDAComputational Science – ICCS 202410.1007/978-3-031-63783-4_29(398-412)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63783-4_29
Rhodes Jde Doncker E(2022)Design and Implementation of an Efficient Priority Queue Data StructureComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10562-3_25(343-357)Online publication date: 4-Jul-2022
https://dl.acm.org/doi/10.1007/978-3-031-10562-3_25
Yu QChilders BHuang LQian CWang Z(2020)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s11227-019-03079-y
Show More Cited By

Recommendations

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures
GPGPU '19: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming ...
Performance evaluation of OpenMP and CUDA on multicore systems
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Nowadays, not only CPU but also GPU goes along the trend of multi-core processors. Parallel processing presents not only an opportunity but also a challenge at the same time. To explicitly parallelize the software by programmers or compilers is the key ...
Graph-Based Substructure Pattern Mining Using CUDA Dynamic Parallelism
IDEAL 2013: Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning --- IDEAL 2013 - Volume 8206

CUDA is an advanced massively parallel computing platform that can provide high performance computing power at much more affordable cost. In this paper, we present a parallel graph-based substructure pattern mining algorithm using CUDA Dynamic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 73, Issue 12

December 2017

434 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Czarnul PMatuszek MKrzywaniak A(2024)Teaching High–performance Computing Systems – A Case Study with Parallel Programming APIs: MPI, OpenMP and CUDAComputational Science – ICCS 202410.1007/978-3-031-63783-4_29(398-412)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63783-4_29
Rhodes Jde Doncker E(2022)Design and Implementation of an Efficient Priority Queue Data StructureComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10562-3_25(343-357)Online publication date: 4-Jul-2022
https://dl.acm.org/doi/10.1007/978-3-031-10562-3_25
Yu QChilders BHuang LQian CWang Z(2020)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s11227-019-03079-y
Manian KAmmar ARuhela AChu CSubramoni HPanda D(2019)Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU ArchitecturesProceedings of the 12th Workshop on General Purpose Processing Using GPUs10.1145/3300053.3319419(43-52)Online publication date: 13-Apr-2019
https://dl.acm.org/doi/10.1145/3300053.3319419

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents