Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Published: 01 December 2017 Publication History

Abstract

The aim of this paper is to evaluate performance of new CUDA mechanisms--unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.

References

[1]
Adinetz A (2014) Adaptive parallel computation with CUDA dynamic parallelism. https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/. Accessed 17 Feb 2016
[2]
Aliaga JI, Davidovic D, Pérez J, Quintana-Ortí ES (2015) Harnessing CUDA dynamic parallelism for the solution of sparse linear systems. In: Joubert GR, Leather H, Parsons M, Peters FJ, Sawyer M (eds.) Parallel Computing: On the Road to Exascale, Proceedings of the International Conference on Parallel Computing, ParCo 2015, 1---4 September 2015, Advances in parallel computing, vol 27. IOS Press, Edinburgh, pp 217---226.
[3]
Caldwell C, Goldbach's conjecture. http://primes.utm.edu/glossary/page.php?sort=GoldbachConjecture. Accessed 10 June 2016
[4]
Czarnul P (2003) Programming, tuning and automatic parallelization of irregular divide-and-conquer applications in DAMPVM/DAC. IJHPCA 17(1):77---93.
[5]
Czarnul P (2016) Benchmarking performance of a hybrid intel xeon/xeon phi system for parallel computation of similarity measures between large vectors. Int J Parallel Program.
[6]
Czarnul P (2016) Parallelization of divide-and-conquer applications on intel xeon phi with an OpenMP based framework. Springer International Publishing, Cham, pp 99---111.
[7]
Czarnul P, Grzeda K (2004) Parallel simulations of electrophysiological phenomena in myocardium on large 32 and 64-bit linux clusters. In: Kranzlmüller D, Kacsuk P, Dongarra J (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, Proceedings, Lecture Notes in Computer Science, vol 3241. Springer, Budapest, Sept 19---22, 2004, pp 234---241.
[8]
DiMarco J, Taufer M (2013) Performance impact of dynamic parallelism on different clustering algorithms. In: SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, pp 87520E---87520E
[9]
Guy R (2013) Unsolved problems in number theory. Springer Science & Business Media, Berlin
[10]
Halliday D, Resnick R, Walker J (2013) Fundamentals of physics extended, 10th edn. Wiley, London
[11]
Jones S (2012) How tesla k20 speeds quicksort, a familiar comp-sci code. https://blogs.nvidia.com/blog/2012/09/12/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/. Accessed 11 June 2016
[12]
Joseph J, Keville K (2015) An evaluation of CUDA unified memory access on NVIDIA tegra k1. Waltham, MA USA. In: IEEE High Performance Extreme Computing Conference (HPEC'15) 19th Annual HPEC Conference
[13]
Khronos OpenCL Working Group, Editor: Lee Howes: The opencl specification version: 2.1, document revision: 23 (2015). https://www.khronos.org/registry/OpenCL/specs/opencl-2.1.pdf
[14]
Landaverde R, Zhang T, Coskun AK, Herbordt M (2014) An investigation of unified memory access performance in CUDA. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp 1---6
[15]
Li D, Wu H, Becchi M (2015) Exploiting dynamic parallelism to efficiently support irregular nested loops on GPUS. In: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, COSMIC'15. ACM, New York, pp 5:1---5:1.
[16]
Li W, Jin G, Cui X, See S (2015) An evaluation of unified memory technology on nvidia gpus. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 1092---1098.
[17]
Mehta V (2015) Exploiting CUDA dynamic parallelism for low power arm based prototypes. In: GPU Technology Conference, San Jose. http://on-demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf
[18]
Mei G (2014) Evaluating the power of GPU acceleration for IDW interpolation algorithm. Sci World J 2014. Article ID 171574.
[19]
Negrut D, Serban R, Li A, Seidl A (2014) Unified memory in CUDA 6.0. a brief overview of related data access and transfer issues. Tech. Rep. TR-2014-09, University of Wisconsin---Madison
[20]
NVIDIA Corporation: Dynamic Parallelism in CUDA (2012). http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf
[21]
NVIDIA Corporation: NVIDIA CUDA C Programming Guide (2017). http://docs.nvidia.com/cuda/cuda-c-programming-guide
[22]
Plauth M, Feinbube F, Schlegel F, Polze A (2015) Using dynamic parallelism for fine-grained, irregular workloads: a case study of the $$n$$n-queens problem. In: 2015 3rd International Symposium on Computing and Networking (CANDAR), pp 404---407.
[23]
Plauth M, Feinbube F, Schlegel F, Polze A (2016) A performance evaluation of dynamic parallelism for fine-grained, irregular workloads. Int J Netw Comput 6(2):212---229. http://www.ijnc.org/index.php/ijnc/article/view/126
[24]
Sakharnykh N (2015) Combine openacc and unified memory for productivity and performance. https://devblogs.nvidia.com/parallelforall/combine-openacc-unified-memory-productivity-performance/
[25]
Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming, 1st edn. Addison-Wesley Professional, Reading
[26]
Souto RP, Osthoff C, de Vasconcelos AT, Augusto DA, da Silva Dias PL, Rodriguez A, Trelles O, Ujaldon M (2014) Applying GPU dynamic parallelism to high-performance normalization of gene expressions. GPU Technology Conference, San Jose. http://on-demand.gputechconf.com/gtc/2014/poster/pdf/P4209_biofinformatics_sort_dynamic_parallelism.pdf
[27]
Theano Development Team (2016) Theano: a python framework for fast computation of mathematical expressions. http://arxiv.org/abs/1605.02688
[28]
Wang J, Yalamanchili S (2014) Characterization and analysis of dynamic parallelism in unstructured GPU applications. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp 51---60.
[29]
Wilkinson B, Allen M (2004) Parallel programming: techniques and applications using networked workstations and parallel computers, edition edn. Pearson. ISBN 978-0131405639
[30]
Zhang P, Holk E, Matty J, Misurda S, Zalewski M, Chu J, McMillan S, Lumsdaine A (2015) Dynamic parallelism for simple and efficient GPU graph algorithms. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3'15. ACM, New York, pp 11:1---11:4.

Cited By

View all
  • (2024)Teaching High–performance Computing Systems – A Case Study with Parallel Programming APIs: MPI, OpenMP and CUDAComputational Science – ICCS 202410.1007/978-3-031-63783-4_29(398-412)Online publication date: 2-Jul-2024
  • (2022)Design and Implementation of an Efficient Priority Queue Data StructureComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10562-3_25(343-357)Online publication date: 4-Jul-2022
  • (2020)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 1-Apr-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing
The Journal of Supercomputing  Volume 73, Issue 12
December 2017
434 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2017

Author Tags

  1. CUDA
  2. Dynamic parallelism
  3. Parallel programming
  4. Unified memory

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Teaching High–performance Computing Systems – A Case Study with Parallel Programming APIs: MPI, OpenMP and CUDAComputational Science – ICCS 202410.1007/978-3-031-63783-4_29(398-412)Online publication date: 2-Jul-2024
  • (2022)Design and Implementation of an Efficient Priority Queue Data StructureComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10562-3_25(343-357)Online publication date: 4-Jul-2022
  • (2020)A quantitative evaluation of unified memory in GPUsThe Journal of Supercomputing10.1007/s11227-019-03079-y76:4(2958-2985)Online publication date: 1-Apr-2020
  • (2019)Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU ArchitecturesProceedings of the 12th Workshop on General Purpose Processing Using GPUs10.1145/3300053.3319419(43-52)Online publication date: 13-Apr-2019

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media