research-article

Open access

Exploring the limits of GPGPU scheduling in control flow bound applications

Authors:

Evgeny Bolotin,

Avinoam Kolodny,

Avi MendelsonAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 8, Issue 4

Article No.: 29, Pages 1 - 22

https://doi.org/10.1145/2086696.2086708

Published: 26 January 2012 Publication History

Abstract

GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication, GPGPU based system can achieve very high performance. However, many general purpose data parallel applications are characterized as having intensive control flow and unpredictable memory access patterns.

Optimizing the code in such problems for current hardware is often ineffective and even impractical since it exhibits low hardware utilization leading to relatively low performance.

This work tracks the root causes of execution inefficacies when running control flow intensive CUDA applications on NVIDIA GPGPU hardware.

We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence. To overcome those limitations we propose to use hierarchical warp scheduling and global warps reconstruction. We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction. We show that in control flow bound applications that make no use of shared memory (1) there is still a substantial potential for performance improvement (2) we demonstrate, based on various synthetic and real benchmarks the feasible performance improvement.

For example, MUM and BFS are parallel graph algorithms suffering from significant branch divergence. We show that in those algorithms it's possible to achieve performance gain of up to x4.4 and x2.6 relative to previously applied scheduling methods.

References

[1]

Aaftab M. 2008. OpenCL, parallel computing on GPU and CPU. In Proceedings of SigGraph.

[2]

Ahn, J. H. 2007. Memory and control organizations of stream processors. Thesis, Stanford University.

Digital Library

[3]

Ariel, A., Fung, W. W. L., Turner, A., and Aamodt, T. M. Visualizing complex dynamics in many-core accelerator architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 164--174.

[4]

Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 163--174.

[5]

Blumofe, R. D. and Leiserson, C. E. 1999. Scheduling multithreaded computations by work stealing. J. ACM. 46, 5.

Digital Library

[6]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheafer, J. W., and Skadron, K. 2008. A performance study of general purpose applications on graphics processors using CUDA. J. Parall. Distrib. Comput. 68, 10.

Digital Library

[7]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheafer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54.

Digital Library

[8]

Conan, B. and Guy, K. 2008. A neural network on GPU. http://www.codeproject.com/KB/graphics/GPUNN. aspx.

[9]

Dehne, F. and Yogaratnam, K. 2010. Exploring the limits of GPU's with parallel graph algorithms. Computer Science Department, Carleton University, Ottawa, Canada.

[10]

Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient gpu control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[11]

Fung, W. W. L. and Aamodt, T. M. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17). 25--36.

Digital Library

[12]

Harish, P. and Narayanan, P. J. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of HiPC. 197--208.

Digital Library

[13]

Horn, R., Houston, M., and Hanrahan, P. 2005. ClawHMMER: A streaming HMMer-search implementation. In Proceedings of the ACM/IEEE Supercomputing Conference.

Digital Library

[14]

Hong, S., Kim, S. K., Oguntebi, T., and Olukotun, K. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of PPoPP.

Digital Library

[15]

Hong, S., Oguntebi, T., and Olukotun, K. 2011. Efficient parallel graph exploration for multi-core CPU and GPU. In Proceedings of PACT.

Digital Library

[16]

Kapasi, U. J., Dally, J., Rixner, W. S., Mattson, P. R., Owens, J. D., and Khailany, B. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of MICRO.

Digital Library

[17]

Karimi Neil, K., Dickson, G., and Hamze, F. 2010. A performance comparison of CUDA and OpenCL. British Columbia Canada.

[18]

Kerr, A., Diamos, G., and Yalamanchili, S. 2009. A characterization and analysis of PTX kernels. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[19]

Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.

Digital Library

[20]

Maxime, B. 2010. Ray tracing in CUDA. http://ercbench.ece.wisc.edu/index.php?option=com_content&view= article&id=59:ray-tracing&catid=18:gpgpu&Itemid=20.

[21]

Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Tech. rep. CS-2010-5, University of Virgina.

[22]

Michalakes, J. and Vachharajani, M. GPU acceleration of numerical weather prediction. IEEE International Symposium on Parallel and Distributed Processing. 1--7.

[23]

NVIDIA. 2009. CUDA C programming best practices guide. CUDA Toolkit 2.3.

[24]

NVIDIA CORPORATION. 2009. NVIDIA's next generation CUDA compute architecture.

[25]

Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00).

Digital Library

[26]

Schatz, M., Trapnell, C., Delcher, A., and Varshney, A. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics 8, 1, 474.

[27]

Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. Department of Electrical and Computer Engineering, University of Toronto.

Cited By

Yang YZhang SShen L(2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293331
(2018)Using program branch probability for the thread parallelisation of branch divergence on the CUDA platformInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2018.09203111:2(171-191)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1504/IJAACS.2018.092031
Yang YZhang SShen L(2018)Control Divergence Optimization through Partial Warp Regrouping in GPGPUsProceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence10.1145/3297156.3297201(369-374)Online publication date: 8-Dec-2018
https://dl.acm.org/doi/10.1145/3297156.3297201
Show More Cited By

Index Terms

Exploring the limits of GPGPU scheduling in control flow bound applications

Recommendations

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 8, Issue 4

Special Issue on High-Performance Embedded Architectures and Compilers

January 2012

765 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2086696

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2012

Accepted: 01 December 2011

Revised: 01 October 2011

Received: 01 July 2011

Published in TACO Volume 8, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Seventh Framework Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
1,224
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)12

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang YZhang SShen L(2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293331
(2018)Using program branch probability for the thread parallelisation of branch divergence on the CUDA platformInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2018.09203111:2(171-191)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1504/IJAACS.2018.092031
Yang YZhang SShen L(2018)Control Divergence Optimization through Partial Warp Regrouping in GPGPUsProceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence10.1145/3297156.3297201(369-374)Online publication date: 8-Dec-2018
https://dl.acm.org/doi/10.1145/3297156.3297201
Tarakji AGladis AAnwar TLeupers R(2015)Enhanced GPU Resource Utilization through Fairness-aware Task SchedulingProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0310.1109/Trustcom.2015.611(45-52)Online publication date: 20-Aug-2015
https://dl.acm.org/doi/10.1109/Trustcom.2015.611
Qi ZYao JZhang CYu MYang ZGuan H(2014)VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud GamingACM Transactions on Architecture and Code Optimization (TACO)10.1145/263221611:2(1-25)Online publication date: 15-Jul-2014
https://dl.acm.org/doi/10.1145/2632216
Burtscher MNasre RPingali K(2012)A quantitative study of irregular programs on GPUsProceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2012.6402918(141-151)Online publication date: 4-Nov-2012
https://dl.acm.org/doi/10.1109/IISWC.2012.6402918

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents