Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-74224-9_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Performance and Portability of a Linear Solver Across Emerging Architectures

Published: 20 November 2020 Publication History

Abstract

A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel® Xeon™ and Xeon Phi™, Marvell® ThunderX2®, NEC® SX-Aurora™ TSUBASA Vector Engine, and NVIDIA® and AMD® GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel® OneAPI™/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.

References

[1]
OpenACC. https://www.openacc.org. Accessed 24 Aug 2020
[2]
OpenMP. https://www.openmp.org. Accessed 24 Aug 2020
[3]
The MPI Forum Website. http://www.mpi-forum.org. Accessed 24 Aug 2020
[4]
AMD Incorporated: AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi50. Accessed 24 Aug 2020
[5]
[6]
[7]
Biedron, R., et al.: FUN3D Manual 13.6. NASA/TM-2019-220416 (2019)
[8]
Codeplay: Codeplay Contribution to DPC++ Brings SYCL Support for NVIDIA GPUs. https://www.codeplay.com/portal/news/2020/02/03/codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus.html. Accessed 24 Aug 2020
[9]
Intel Corporation: Intel oneAPI DPC++ Compiler (Beta). https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html. Accessed 24 Aug 2020
[10]
Intel Corporation: Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 24 Aug 2020
[11]
Khronos Group: OpenCL. https://www.khronos.org/opencl/. Accessed 24 Aug 2020
[12]
Khronos Group: SYCL. https://www.khronos.org/sycl/. Accessed 24 Aug 2020
[13]
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D User’s Guide, May 1989
[14]
Korzun, A., et al.: Effects of Spatial Resolution on Retropropulsion Aerodynamics in an Atmospheric Environment. AIAA SciTech Forum (2020)
[15]
Kreutzer M, Hager G, Wellein G, Fehske H, and Bishop AR A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units SIAM J. Sci. Comput. 2014 36 5 C401-C423
[16]
Laflin KR et al. Data summary from second AIAA computational fluid dynamics drag prediction workshop J. Aircraft 2005 42 5 1165-1178
[17]
Medina, D.S., St-Cyr, A., Warburton, T.: OCCA: A Unified Approach to Multi-Threading Languages. arXiv preprint arXiv:1403.0968 (2014)
[18]
NEC Corporation: SX-Aurora TSUBASA Fortran Compiler User’s Guide. https://www.hpc.nec/documents/sdk/pdfs/g2af02e-FortranUsersGuide-018.pdf. Accessed 24 Aug 2020
[19]
NEC Corporation: SX-Aurora TSUBASA VEOS NUMA Mode Guide for Partitioning Mode. https://www.hpc.nec/documents/guide/pdfs/VEOS_NUMA_Mode4PartitioningMode_E.pdf. Accessed 24 Aug 2020
[20]
Nielsen EJ and Diskin B High-performance aerodynamic computations for aerospace applications Parallel Comput. 2017 64 20-32
[21]
NVIDIA Corporation: cuBLAS. https://developer.nvidia.com/cublas. Accessed 24 Aug 2020
[22]
NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4Hicq83a9. Accessed 24 Aug 2020
[23]
NVIDIA Corporation: cuSPARSE. https://developer.nvidia.com/cusparse. Accessed 24 Aug 2020
[24]
Oak Ridge National Laboratory: Exascale System Expected to be World’s Most Powerful Computer for Science and Innovation. https://www.olcf.ornl.gov/2019/05/07/no-scaling-back-doe-cray-amd-to-bring-exascale-to-ornl/. Accessed 24 Aug 2020
[25]
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
[26]
ANANDTECH: Assessing Cavium’s ThunderX2: The Arm Server Dream Realized At Last (2018). https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
[27]
Walden, A., Nielsen, E., Diskin, B., Zubair, M.: A mixed precision multicolor point-implicit solver for unstructured grids on GPUs. In: Proceedings of the Ninth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019, Los Alamitos, CA, USA, pp. 23–30. IEEE Press (2019)
[28]
Zubair, M., Nielsen, E., Luitjens, J., Hammond, D.: An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, Piscataway, NJ, USA, pp. 18–25. IEEE Press (2016)

Index Terms

  1. Performance and Portability of a Linear Solver Across Emerging Architectures
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Accelerator Programming Using Directives: 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings
    Nov 2020
    107 pages
    ISBN:978-3-030-74223-2
    DOI:10.1007/978-3-030-74224-9

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 20 November 2020

    Author Tags

    1. Programming models
    2. Performance portability
    3. Emerging architecture
    4. CFD
    5. HPC
    6. CUDA
    7. OpenACC
    8. OCCA
    9. AVX-512 intrinsics
    10. Neon intrinsics
    11. Arm
    12. GPU
    13. V100
    14. A100
    15. MI50
    16. Xeon Phi
    17. SX-Aurora
    18. ThunderX2

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media