Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-74224-9_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Performance and Portability of a Linear Solver Across Emerging Architectures

Published: 20 November 2020 Publication History

Abstract

A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel® Xeon™ and Xeon Phi™, Marvell® ThunderX2®, NEC® SX-Aurora™ TSUBASA Vector Engine, and NVIDIA® and AMD® GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel® OneAPI™/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.

References

[1]
OpenACC. https://www.openacc.org. Accessed 24 Aug 2020
[2]
OpenMP. https://www.openmp.org. Accessed 24 Aug 2020
[3]
The MPI Forum Website. http://www.mpi-forum.org. Accessed 24 Aug 2020
[4]
AMD Incorporated: AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi50. Accessed 24 Aug 2020
[5]
[6]
[7]
Biedron, R., et al.: FUN3D Manual 13.6. NASA/TM-2019-220416 (2019)
[8]
Codeplay: Codeplay Contribution to DPC++ Brings SYCL Support for NVIDIA GPUs. https://www.codeplay.com/portal/news/2020/02/03/codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus.html. Accessed 24 Aug 2020
[9]
Intel Corporation: Intel oneAPI DPC++ Compiler (Beta). https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html. Accessed 24 Aug 2020
[10]
Intel Corporation: Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 24 Aug 2020
[11]
Khronos Group: OpenCL. https://www.khronos.org/opencl/. Accessed 24 Aug 2020
[12]
Khronos Group: SYCL. https://www.khronos.org/sycl/. Accessed 24 Aug 2020
[13]
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D User’s Guide, May 1989
[14]
Korzun, A., et al.: Effects of Spatial Resolution on Retropropulsion Aerodynamics in an Atmospheric Environment. AIAA SciTech Forum (2020)
[15]
Kreutzer M, Hager G, Wellein G, Fehske H, and Bishop AR A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units SIAM J. Sci. Comput. 2014 36 5 C401-C423
[16]
Laflin KR et al. Data summary from second AIAA computational fluid dynamics drag prediction workshop J. Aircraft 2005 42 5 1165-1178
[17]
Medina, D.S., St-Cyr, A., Warburton, T.: OCCA: A Unified Approach to Multi-Threading Languages. arXiv preprint arXiv:1403.0968 (2014)
[18]
NEC Corporation: SX-Aurora TSUBASA Fortran Compiler User’s Guide. https://www.hpc.nec/documents/sdk/pdfs/g2af02e-FortranUsersGuide-018.pdf. Accessed 24 Aug 2020
[19]
NEC Corporation: SX-Aurora TSUBASA VEOS NUMA Mode Guide for Partitioning Mode. https://www.hpc.nec/documents/guide/pdfs/VEOS_NUMA_Mode4PartitioningMode_E.pdf. Accessed 24 Aug 2020
[20]
Nielsen EJ and Diskin B High-performance aerodynamic computations for aerospace applications Parallel Comput. 2017 64 20-32
[21]
NVIDIA Corporation: cuBLAS. https://developer.nvidia.com/cublas. Accessed 24 Aug 2020
[22]
NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4Hicq83a9. Accessed 24 Aug 2020
[23]
NVIDIA Corporation: cuSPARSE. https://developer.nvidia.com/cusparse. Accessed 24 Aug 2020
[24]
Oak Ridge National Laboratory: Exascale System Expected to be World’s Most Powerful Computer for Science and Innovation. https://www.olcf.ornl.gov/2019/05/07/no-scaling-back-doe-cray-amd-to-bring-exascale-to-ornl/. Accessed 24 Aug 2020
[25]
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
[26]
ANANDTECH: Assessing Cavium’s ThunderX2: The Arm Server Dream Realized At Last (2018). https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
[27]
Walden, A., Nielsen, E., Diskin, B., Zubair, M.: A mixed precision multicolor point-implicit solver for unstructured grids on GPUs. In: Proceedings of the Ninth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019, Los Alamitos, CA, USA, pp. 23–30. IEEE Press (2019)
[28]
Zubair, M., Nielsen, E., Luitjens, J., Hammond, D.: An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, Piscataway, NJ, USA, pp. 18–25. IEEE Press (2016)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Accelerator Programming Using Directives: 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings
Nov 2020
107 pages
ISBN:978-3-030-74223-2
DOI:10.1007/978-3-030-74224-9

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 November 2020

Author Tags

  1. Programming models
  2. Performance portability
  3. Emerging architecture
  4. CFD
  5. HPC
  6. CUDA
  7. OpenACC
  8. OCCA
  9. AVX-512 intrinsics
  10. Neon intrinsics
  11. Arm
  12. GPU
  13. V100
  14. A100
  15. MI50
  16. Xeon Phi
  17. SX-Aurora
  18. ThunderX2

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media