Memory-Optimized Wavefront Parallelism on GPUs

Yuanzhe Li¹ &
Loren Schwiebert¹

583 Accesses
5 Citations
Explore all metrics

Abstract

Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Parallel GPU Implementation of SWIFFTX

Automatic Parameter Optimization for Edit Distance Algorithm on GPU

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Article 20 March 2015

References

Balhaf, K., Alsmirat, M.A., Al-Ayyoub, M., Jararweh, Y., Shehab, M.A.: Accelerating levenshtein and damerau edit distance algorithms using gpu with unified memory. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 7–11. IEEE (2017)
Bednárek, D., Brabec, M., Kruliš, M.: Improving matrix-based dynamic programming on massively parallel accelerators. Inf. Syst. 64, 175–193 (2017)
Article Google Scholar
Belviranli, M.E., Deng, P., Bhuyan, L.N., Gupta, R., Zhu, Q.: Peerwave: Exploiting wavefront parallelism on GPUS with peer-SM synchronization. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 25–35. ACM (2015)
Crow, F.C.: Summed-area tables for texture mapping. In: ACM SIGGRAPH Computer Graphics, vol. 18, pp. 207–212. ACM (1984)
Deorowicz, S.: Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exp. 40(8), 673–700 (2010)
Article Google Scholar
Di, P., Wu, H., Xue, J., Wang, F., Yang, C.: Parallelizing sor for gpgpus using alternate loop tiling. Parallel Comput. 38(6–7), 310–328 (2012)
Article MathSciNet Google Scholar
Di, P., Xue, J.: Model-driven tile size selection for doacross loops on GPUS. In: European Conference on Parallel Processing, pp. 401–412. Springer (2011)
Di, P., Ye, D., Su, Y., Sui, Y., Xue, J.: Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUS. In: 2012 41st International Conference on Parallel Processing, pp. 350–359. IEEE (2012)
Grama, A., Kumar, V., Gupta, A., Karypis, G.: Introduction to Parallel Computing. Pearson Education, London (2003)
MATH Google Scholar
Hou, K., Wang, H., Feng, W.c., Vetter, J.S., Lee, S.: Highly efficient compensation-based parallelism for wavefront loops on GPUS. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 276–285. IEEE (2018)
Khajeh-Saeed, A., Poole, S., Perot, J.B.: Acceleration of the smith–waterman algorithm using single and multiple graphics processors. J. Comput. Phys. 229(11), 4247–4258 (2010)
Article MathSciNet Google Scholar
Li, X., Liang, Y., Yan, S., Jia, L., Li, Y.: A coordinated tiling and batching framework for efficient GEMM on GPUS. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 229–241. ACM (2019)
Li, Y., Ghalami, L., Schwiebert, L., Grosu, D.: A GPU parallel approximation algorithm for scheduling parallel identical machines to minimize makespan. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 619–628. IEEE (2018)
Li, Z., Goyal, A., Kimm, H.: Parallel longest common sequence algorithm on multicore systems using openacc, openmp and openmpi. In:2017 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 158–165. IEEE (2017)
Liu, Y., Wirawan, A., Schmidt, B.: Cudasw++ 3.0: accelerating smith–waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform. 14(1), 117 (2013)
Article Google Scholar
Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015)
Article MathSciNet Google Scholar
Nvidia Corporation: CUDA Programming Guide (2018)
Rawat, P.S., Hong, C., Ravishankar, M., Grover, V., Pouchet, L.N., Rountev, A., Sadayappan, P.: Resource conscious reuse-driven tiling for GPUS. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp. 99–111. ACM (2016)
Remmelg, T., Lutz, T., Steuwer, M., Dubach, C.: Performance portable GPU code generation for matrix multiplication. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, pp. 22–31. ACM (2016)
Smith, T.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2(4), 482–489 (1981)
Article MathSciNet Google Scholar
Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges (2009)
Tang, Y., You, R., Kan, H., Tithi, J.J., Ganapathi, P., Chowdhury, R.A.: Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In: ACM SIGPLAN Notices, vol. 50, pp. 205–214. ACM (2015)
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Wayne State University, Detroit, USA
Yuanzhe Li & Loren Schwiebert

Authors

Yuanzhe Li
View author publications
You can also search for this author in PubMed Google Scholar
Loren Schwiebert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanzhe Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Schwiebert, L. Memory-Optimized Wavefront Parallelism on GPUs. Int J Parallel Prog 48, 1008–1031 (2020). https://doi.org/10.1007/s10766-020-00658-y

Download citation

Received: 07 June 2019
Accepted: 10 March 2020
Published: 25 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10766-020-00658-y

Abstract

Access this article

Subscribe and save

Buy Now