Abstract
This work evaluated the use of OpenMP tasking with target GPU offloading as a potential solution for programming productivity and performance on heterogeneous systems. Also, it is proposed a new OpenMP specification to make the implementation of heterogeneous codes simpler by using OpenMP target task, which integrates both OpenMP tasking and target GPU offloading in a single OpenMP pragma. As a test case, the authors used one of the most popular and widely used Basic Linear Algebra Subprogram Level-3 routines: triangular solver (TRSM). To benefit from the heterogeneity of the current high-performance computing systems, the authors propose a different parallelization of the algorithm by using a nonuniform decomposition of the problem. This work used target GPU offloading inside OpenMP tasks to address the heterogeneity found in the hardware. This new approach can outperform the state-of-the-art algorithms, which use a uniform decomposition of the data, on both the CPU-only and hybrid CPU-GPU systems, reaching speedups of up to one order of magnitude. The performance that this approach achieves is faster than the IBM ESSL math library on CPU and competitive relative to a highly optimized heterogeneous CUDA version. One node of Oak Ridge National Laboratory’s supercomputer, Summit, was used for performance analysis.
Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
AMD: AOMP, June 2021. https://rocmdocs.amd.com/en/latest/Programming_Guides/aomp.html
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631
Catalán, S., Martorell, X., Labarta, J., Usui, T., Díaz, L.A.T., Valero-Lara, P.: Accelerating conjugate gradient using OmpSs. In: 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, 5–7 December 2019, Gold Coast, Australia, pp. 121–126. IEEE (2019). https://doi.org/10.1109/PDCAT46702.2019.00033
Catalán, S., Usui, T., Toledo, L., Martorell, X., Labarta, J., Valero-Lara, P.: Towards an auto-tuned and task-based SpMV (LASs library). In: Milfeld, K., de Supinski, B.R., Koesterke, L., Klinkenberg, J. (eds.) OpenMP: Portable Multi-Level Parallelism on Modern Systems, IWOMP 2020, LNCS, vol. 12295, pp. 115–129. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58144-2_8
Cray: CCE OpenMP, June 2021. https://pubs.cray.com/bundle/Cray_Fortran_Reference_Manual_S-3901_11-0/page/OpenMP_Overview.html
Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990). https://doi.org/10.1145/77626.79170
Dongarra, J.J., et al.: PLASMA: parallel linear algebra software for multicore using OpenMP. ACM Trans. Math. Softw. 45(2), 16:1-16:35 (2019). https://doi.org/10.1145/3264491
Duran, A., et al.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(2), 173–193 (2011). https://doi.org/10.1142/S0129626411000151
GNU: GCC OpenMP, June 2021. https://gcc.gnu.org/wiki/Offloading
Haidar, A., Ltaief, H., Dongarra, J.J.: Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In: Lathrop, S.A., Costa, J., Kramer, W. (eds.) Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, 12–18 November 2011, Seattle, WA, USA, pp. 8:1–8:11. ACM (2011). https://doi.org/10.1145/2063384.2063394
IBM: XLC OpenMP, June 2021. https://www.ibm.com/docs/en/xl-c-and-cpp-linux/13.1.6?topic=gpus-programming-openmp-device-constructs
Intel: OneAPI, June 2021. https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-cpp-fortran-compiler-openmp/top.html
LLVM: OpenMP, June 2021. https://llvm.org/docs/AMDGPUUsage.html#target-triples
Nath, R., Tomov, S., Dongarra, J.J.: An improved magma GEMM for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010). https://doi.org/10.1177/1094342010385729
NVIDIA: NVCC OpenMP, June 2021. https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#openmp-use
Planas, J., Badia, R.M., Ayguadé, E., Labarta, J.: Self-adaptive OmpSs tasks in heterogeneous environments. In: 27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, 20–24 May 2013, Cambridge, MA, USA, pp. 138–149. IEEE Computer Society (2013). https://doi.org/10.1109/IPDPS.2013.53
Valero-Lara, P., Catalán, S., Martorell, X., Labarta, J.: BLAS-3 optimized by OmpSs regions (LASs library). In: 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2019, 13–15 February 2019, Pavia, Italy, pp. 25–32. IEEE (2019). https://doi.org/10.1109/EMPDP.2019.8671545
Valero-Lara, P., Catalán, S., Martorell, X., Usui, T., Labarta, J.: sLASs: a fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs library). J. Parallel Distrib. Comput. 138, 153–171 (2020). https://doi.org/10.1016/j.jpdc.2019.12.002
Valero-Lara, P., Sirvent, R., Peña, A.J., Labarta, J.: MPI+OpenMP tasking scalability for multi-morphology simulations of the human brain. Parallel Comput. 84, 50–61 (2019). https://doi.org/10.1016/j.parco.2019.03.006
Valero-Lara, P., Sirvent, R., Peña, A.J., Martorell, X., Labarta, J.: MPI+OpenMP tasking scalability for the simulation of the human brain: human brain project. In: Proceedings of the 25th European MPI Users’ Group Meeting, 23–26 September 2018, Barcelona, Spain, pp. 5:1–5:8. ACM (2018). https://doi.org/10.1145/3236367.3236373
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Valero-Lara, P., Kim, J., Hernandez, O., Vetter, J. (2022). OpenMP Target Task: Tasking and Target Offloading on Heterogeneous Systems. In: Chaves, R., et al. Euro-Par 2021: Parallel Processing Workshops. Euro-Par 2021. Lecture Notes in Computer Science, vol 13098. Springer, Cham. https://doi.org/10.1007/978-3-031-06156-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-06156-1_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06155-4
Online ISBN: 978-3-031-06156-1
eBook Packages: Computer ScienceComputer Science (R0)