article

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Authors:

Gregorio Bernabé,

Domingo GiménezAuthors Info & Claims

International Journal of Parallel Programming, Volume 43, Issue 6

Pages 1160 - 1191

https://doi.org/10.1007/s10766-014-0328-3

Published: 01 December 2015 Publication History

Abstract

This work presents an optimization method to run the 3D-fast wavelet transform (3D-FWT) on a CPU + GPU system. The optimization engine detects the different computing components in the system, and executes the appropriate kernel implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. This engine automatically selects parameters such as the block size, the work-group size or the number of threads to reduce the execution time, and sends proportionally different parts of a video sequence to run concurrently in all the computing components of the system. An analysis of the development and optimization of the 3D-FWT for a hybrid cluster of CPU + GPUs is also described. Different parallel programming paradigms (message passing, shared memory and GPU SIMD) are combined to fully exploit the computing capacity of the different computational elements of the cluster, so resulting in an efficient combination of basic codes developed previously for individual components (CPUs or GPUs) and an important reduction of the compression time of long video sequences.

References

[1]

Manocha, D.: General-purpose computation using graphic processors. IEEE Comput. 38(8), 85---88 (2005)

Digital Library

[2]

Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80---113 (2007)

[3]

Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879---889 (2008)

[4]

CUDA Zone maintained by Nvidia: http://www.nvidia.com/object/cuda.html (2009)

[5]

AMD stream computing: http://ati.amd.com/technology/streamcomputing/index.html (2009)

[6]

The Khronos Group, The OpenCL core API specification: http://www.khronos.org/registry/cl (2011)

[7]

Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: The 2D wavelet transform on emerging architectures: GPUs and multicores. J. Realt. Image Process. 3, 145---152 (2012)

[8]

Franco, J., Bernabé, G., Fernández, J., Acacio, M.E. : A parallel implementation of the 2D wavelet transform using CUDA. In: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2009)

Digital Library

[9]

Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10th International Conference on Computational Science (2010)

[10]

Bernabé, G., Guerrero, G.D., Fernández, J.: CUDA and OpenCL implementations of 3D fast wavelet transform. In: 3rd IEEE Latin American Symposium on Circuits and Systems (2012)

[11]

Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International Conference on Computational Science (2013)

[12]

Bernabé, G., Cuenca, J., Giménez, D.: Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th International Symposium on Computer Architecture and High Performance Computing (2013)

Digital Library

[13]

Mallat, S.: A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674---693 (1989)

Digital Library

[14]

Bernabé, G., González, J., García, J. M., Duato, J.: A new lossy 3-D wavelet transform for high-quality compression of medical video. In: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, pp. 226---231 (2000)

[15]

Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA (1992)

[16]

Meerwald, P., Norcen, R., Uhl, A.: Cache issues with JPEG2000 wavelet lifting. In: Proceedings of the Visual Communications and Image Processing Conference, pp. 626---634 (2002)

[17]

Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Improving the memory behavior of vertical filtering in the discrete wavelet transform. In: Proceedings of the ACM Conference in Computing Frontiers, pp. 253---260 (2006)

[18]

Tao, J., Shahbahrami, A., Juurlink, B., Buchty, R., Karl, W., Vassiliadis, S.: Optimizing cache performance of the discrete wavelet transform using a visualization tool. In: Proceedings of the IEEE International Symposium on Multimedia, pp. 153---160 (2007)

Digital Library

[19]

Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1---2), 3---35 (2001)

[20]

Im, E.J., Yelick, K., Vuduc, R.: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135---158 (2004)

Digital Library

[21]

Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE Spec. Issue Progr. Gener. Optim. Platf. Adapt. 93(2), 216---231 (2005)

[22]

Frigo, M.: A fast fourier transform compiler. In: Proceedings of the Conference on Programming Language Design and Implementation (ACM SIGPLAN), pp. 169---180 (1999)

[23]

Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231---250 (2006)

Digital Library

[24]

Carvalho, E., Calazans, N., Moraes, F.: Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping, pp. 34---40 (2007)

Digital Library

[25]

Almeida, F., González, D., Moreno, L.: The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J. Syst. Archit. 52, 105---116 (2006)

Digital Library

[26]

Giersch, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on heterogeneous master-slave platforms. J. Syst. Archit. 52, 88---104 (2006)

Digital Library

[27]

Hsu, C., Chen, T., Li, K.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Futur. Gener. Comput. Syst. 23, 569---579 (2007)

Digital Library

[28]

Banino, C., Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Robert, Y.: Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans. Parallel Distrib. Syst. 15, 319---330 (2004)

Digital Library

[29]

Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC '08) (2008)

[30]

Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for fermi. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC '11) (2011)

[31]

Yinan, L., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 884---892 (2009)

[32]

Davidson, A., Owens, J.: Toward techniques for auto-tuning GPU algorithms. Appl. Parallel Sci. Comput. 7134, 110---119 (2012)

[33]

Spiga, F., Girotto, I.: phiGEMM: A CPU---GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp. 368---375 (2008)

[34]

Fatica, M.: Accelerating LINPACK with CUDA on heterogenous clusters. In: Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 46---51 (2009)

Digital Library

[35]

QE-FORGE: http://qe-forge.org/gf/ (2012)

[36]

Wang, F., Yang, C., Du, Y., Chen, H.Y.J., Xu, W.: Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J. Comput. Sci. Technol. 26, 854---865 (2011)

Digital Library

[37]

Tsai, Y., Wang, W., Chen, R.: Tuning block size for QR factorization on CPU---GPU hybrid systems. In: Proceedings of the IEEE 6th International Symposium on Embedded Multicore Socs (MCSoC), pp. 205---211 (2012)

[38]

Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J. Comput. Sci. Technol. 23, 187---198 (2011)

[39]

Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1---12 (2010)

[40]

Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., Amarasinghe, S.P.: Portable performance on heterogeneous architectures. In: 18th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), pp. 431---444 (2013)

[41]

NVIDIA Tutorial at PDP'08, CUDA: A New Architecture for Computing on the GPU. IEEE Computer Society, Toulouse (2008)

Cited By

Bernabé GCasanova JCuenca JGonzález-Carrillo J(2019)A self-optimized software tool for quantifying the degree of left ventricle hyper-trabeculationThe Journal of Supercomputing10.1007/s11227-018-2722-x75:3(1625-1640)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2722-x

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Optimizing a 3D-FWT Code in a Heterogeneous Cluster of Multicore CPUs and Manycore GPUs
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance Computing

Clusters of nodes composed of many core GPUs and multicore CPUs are used to solve scientific problems with high computational requirements. The development and optimization of parallel-heterogeneous codes for these systems is a complex task which ...
Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

This paper presents an enhanced auto-optimization method to run the 3D-Fast Wavelet Transform on different computing units in a system (GPU, MIC, CPU). The proposed method automatically selects a set of parameter values (block size, number of streams ...
A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

BigGPU enables users to regard a hybrid CPU/GPU cluster as a big GPU.BigGPU supports users to develop applications on hybrid CPU/GPU clusters by using only CUDA.BigGPU supports load balance, large virtual global memory and thread configuration for CUDA ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 43, Issue 6

December 2015

283 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 2015 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bernabé GCasanova JCuenca JGonzález-Carrillo J(2019)A self-optimized software tool for quantifying the degree of left ventricle hyper-trabeculationThe Journal of Supercomputing10.1007/s11227-018-2722-x75:3(1625-1640)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11227-018-2722-x

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents