Abstract
With the emergence of heterogeneous architectures, developing parallel software has become an increasingly complex task. The ability of using multiple devices in a single application, such as CPUs, accelerators, or coprocessors, has turned the implementation and optimization tasks into a challenging process, which comes along with a variety of difficulties. The inherent complexities of the parallel algorithm, its multiple implementations, and the mapping possibilities onto one of the available processors are just examples of how intricate these tasks can become. To alleviate these issues, this paper proposes a hybrid static–dynamic selector to better exploit resources provided by heterogeneous systems. Specifically, this framework generates at compile time a decision tree based on historical information for selecting the implementation that performs best at run-time. To evaluate the benefits of this approach, we analyze the performance with two use cases: the general matrix–matrix multiplication and an image processing medical application. The experimental results demonstrate that our proposed selector enhances performance and minimizes efforts needed to tune applications. We proved that our solution improves from 10 to 24% the overall application performance in comparison with other similar approach.
Similar content being viewed by others
References
Brodtkorb AR, Dyken C, Hagen TR, Hjelmervik JM, Storaasli OO (2010) State-of-the-art in heterogeneous computing. Sci Program 18(1):1–33. doi:10.1155/2010/540159
Canales-Rodríguez EJ, Daducci A, Sotiropoulos SN, Caruyer E, Aja-Fernández S, Radua J, Mendizabal JMY, Iturria-Medina Y, Melie-García L, Alemán-Gómez Y et al (2015) Spherical deconvolution of multichannel diffusion MRI data with non-Gaussian noise models and spatial regularization. PloS One 10(10):e0138910
clMathLibraries (2015) clBLAS. https://github.com/clMathLibraries/clBLAS
Daoud MI, Kharma N (2006) Efficient compile-time task scheduling for heterogeneous distributed computing systems. In: 12th International Conference on Parallel and Distributed Systems—(ICPADS’06), vol 1, 9 pp
Dastgeer U, Li L, Kessler C (2013) Adaptive implementation selection in the SkePU skeleton programming library. In: Advanced Parallel Processing Technologies: 10th International Symposium, APPT 2013, Stockholm, Sweden, 27–28 August 2013, Revised Selected Papers. Springer, Berlin, pp 170–183
Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21:173–193. doi:10.1142/S0129626411000151
Garcia-Blas J (2016) Parallel high angular resolution diffusion imaging toolbox. https://bitbucket.org/fjblas/phardi
Garcia-Blas J, Dolz MF, García JD, Carretero J, Daducci A, Alemán-Gómez Y, Canales-Rodríguez EJ (2016) Porting Matlab applications to high-performance C++ codes: CPU/GPU-accelerated spherical deconvolution of diffusion MRI data. In: Algorithms and Architectures for Parallel Processing—16th International Conference, ICA3PP 2016, Granada, Spain, 14–16 December 2016, Proceedings, pp 630–643. doi:10.1007/978-3-319-49583-5_49
Intel (2015) MKL—Math Kernel Library. https://software.intel.com/en-us/intel-mkl
Maurer J, Wong M (2008) Towards support for attributes in C++ (Revision 6). In: JTC1/SC22/WG21—The C++ Standards Committee. N2761=08-0271
nVidia (2012) cuBLAS library user guide. nVidia, v5.0 edn
Sotomayor R, Sanchez LM, Garcia-Blas J, Calderon A, Fernandez J (2015) AKI: automatic kernel identification and annotation tool based on C++ attributes. In: Proceedings of the IEEE TrustCom-BigDataSE-ISPA, pp 148–156
Sanchez LM, del Rio Astorga D, Dolz MF, Fernández J (2016) CID: a compile-time implementation decider for heterogeneous platforms based on C++ attributes. In: 2016 International IEEE Conference on Scalable Computing and Communications (ScalCom), pp 1149–1156. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0177
Shen J, Varbanescu A, Sips H (2014) Look before you leap: using the right hardware resources to accelerate applications. In: 2014 IEEE 6th International Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and Systems (HPCC, CSS, ICESS), 2014 IEEE International Conference on High Performance Computing and Communications, pp 383–391
Su LT (2013) Architecting the future through heterogeneous computing. In: 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp 8–11. doi:10.1109/ISSCC.2013.6487618
Tan WJ, Tang WT, Goh R, Turner S, Wong WF (2015) A code generation framework for targeting optimized library calls for multiple platforms. IEEE Trans Parallel Distrib Syst 26(7):1789–1799
Zhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-gpu platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518. doi:10.1109/TC.2014.2375202
Acknowledgements
This work has been partially supported by the EU Project ICT 644235 “RePhrase: REfactoring Parallel Heterogeneous Resource-Aware Applications” and the Project TIN2016-79637-P “Towards Unification of HPC and Big Data Paradigms” from the Spanish “Ministerio de Economía y Competitividad”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
del Rio Astorga, D., Dolz, M.F., Fernandez, J. et al. Hybrid static–dynamic selection of implementation alternatives in heterogeneous environments. J Supercomput 75, 4098–4113 (2019). https://doi.org/10.1007/s11227-017-2147-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2147-y