Abstract
Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.
Similar content being viewed by others
References
Balhaf, K., Alsmirat, M.A., Al-Ayyoub, M., Jararweh, Y., Shehab, M.A.: Accelerating levenshtein and damerau edit distance algorithms using gpu with unified memory. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 7–11. IEEE (2017)
Bednárek, D., Brabec, M., Kruliš, M.: Improving matrix-based dynamic programming on massively parallel accelerators. Inf. Syst. 64, 175–193 (2017)
Belviranli, M.E., Deng, P., Bhuyan, L.N., Gupta, R., Zhu, Q.: Peerwave: Exploiting wavefront parallelism on GPUS with peer-SM synchronization. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 25–35. ACM (2015)
Crow, F.C.: Summed-area tables for texture mapping. In: ACM SIGGRAPH Computer Graphics, vol. 18, pp. 207–212. ACM (1984)
Deorowicz, S.: Solving longest common subsequence and related problems on graphical processing units. Softw. Pract. Exp. 40(8), 673–700 (2010)
Di, P., Wu, H., Xue, J., Wang, F., Yang, C.: Parallelizing sor for gpgpus using alternate loop tiling. Parallel Comput. 38(6–7), 310–328 (2012)
Di, P., Xue, J.: Model-driven tile size selection for doacross loops on GPUS. In: European Conference on Parallel Processing, pp. 401–412. Springer (2011)
Di, P., Ye, D., Su, Y., Sui, Y., Xue, J.: Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUS. In: 2012 41st International Conference on Parallel Processing, pp. 350–359. IEEE (2012)
Grama, A., Kumar, V., Gupta, A., Karypis, G.: Introduction to Parallel Computing. Pearson Education, London (2003)
Hou, K., Wang, H., Feng, W.c., Vetter, J.S., Lee, S.: Highly efficient compensation-based parallelism for wavefront loops on GPUS. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 276–285. IEEE (2018)
Khajeh-Saeed, A., Poole, S., Perot, J.B.: Acceleration of the smith–waterman algorithm using single and multiple graphics processors. J. Comput. Phys. 229(11), 4247–4258 (2010)
Li, X., Liang, Y., Yan, S., Jia, L., Li, Y.: A coordinated tiling and batching framework for efficient GEMM on GPUS. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 229–241. ACM (2019)
Li, Y., Ghalami, L., Schwiebert, L., Grosu, D.: A GPU parallel approximation algorithm for scheduling parallel identical machines to minimize makespan. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 619–628. IEEE (2018)
Li, Z., Goyal, A., Kimm, H.: Parallel longest common sequence algorithm on multicore systems using openacc, openmp and openmpi. In:2017 IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp. 158–165. IEEE (2017)
Liu, Y., Wirawan, A., Schmidt, B.: Cudasw++ 3.0: accelerating smith–waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform. 14(1), 117 (2013)
Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015)
Nvidia Corporation: CUDA Programming Guide (2018)
Rawat, P.S., Hong, C., Ravishankar, M., Grover, V., Pouchet, L.N., Rountev, A., Sadayappan, P.: Resource conscious reuse-driven tiling for GPUS. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp. 99–111. ACM (2016)
Remmelg, T., Lutz, T., Steuwer, M., Dubach, C.: Performance portable GPU code generation for matrix multiplication. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, pp. 22–31. ACM (2016)
Smith, T.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2(4), 482–489 (1981)
Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges (2009)
Tang, Y., You, R., Kan, H., Tithi, J.J., Ganapathi, P., Chowdhury, R.A.: Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In: ACM SIGPLAN Notices, vol. 50, pp. 205–214. ACM (2015)
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Schwiebert, L. Memory-Optimized Wavefront Parallelism on GPUs. Int J Parallel Prog 48, 1008–1031 (2020). https://doi.org/10.1007/s10766-020-00658-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-020-00658-y