Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Exploring the limits of GPGPU scheduling in control flow bound applications

Published: 26 January 2012 Publication History

Abstract

GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication, GPGPU based system can achieve very high performance. However, many general purpose data parallel applications are characterized as having intensive control flow and unpredictable memory access patterns.
Optimizing the code in such problems for current hardware is often ineffective and even impractical since it exhibits low hardware utilization leading to relatively low performance.
This work tracks the root causes of execution inefficacies when running control flow intensive CUDA applications on NVIDIA GPGPU hardware.
We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence. To overcome those limitations we propose to use hierarchical warp scheduling and global warps reconstruction. We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction. We show that in control flow bound applications that make no use of shared memory (1) there is still a substantial potential for performance improvement (2) we demonstrate, based on various synthetic and real benchmarks the feasible performance improvement.
For example, MUM and BFS are parallel graph algorithms suffering from significant branch divergence. We show that in those algorithms it's possible to achieve performance gain of up to x4.4 and x2.6 relative to previously applied scheduling methods.

References

[1]
Aaftab M. 2008. OpenCL, parallel computing on GPU and CPU. In Proceedings of SigGraph.
[2]
Ahn, J. H. 2007. Memory and control organizations of stream processors. Thesis, Stanford University.
[3]
Ariel, A., Fung, W. W. L., Turner, A., and Aamodt, T. M. Visualizing complex dynamics in many-core accelerator architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 164--174.
[4]
Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 163--174.
[5]
Blumofe, R. D. and Leiserson, C. E. 1999. Scheduling multithreaded computations by work stealing. J. ACM. 46, 5.
[6]
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheafer, J. W., and Skadron, K. 2008. A performance study of general purpose applications on graphics processors using CUDA. J. Parall. Distrib. Comput. 68, 10.
[7]
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheafer, J., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54.
[8]
Conan, B. and Guy, K. 2008. A neural network on GPU. http://www.codeproject.com/KB/graphics/GPUNN. aspx.
[9]
Dehne, F. and Yogaratnam, K. 2010. Exploring the limits of GPU's with parallel graph algorithms. Computer Science Department, Carleton University, Ottawa, Canada.
[10]
Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient gpu control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.
[11]
Fung, W. W. L. and Aamodt, T. M. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17). 25--36.
[12]
Harish, P. and Narayanan, P. J. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of HiPC. 197--208.
[13]
Horn, R., Houston, M., and Hanrahan, P. 2005. ClawHMMER: A streaming HMMer-search implementation. In Proceedings of the ACM/IEEE Supercomputing Conference.
[14]
Hong, S., Kim, S. K., Oguntebi, T., and Olukotun, K. 2011. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of PPoPP.
[15]
Hong, S., Oguntebi, T., and Olukotun, K. 2011. Efficient parallel graph exploration for multi-core CPU and GPU. In Proceedings of PACT.
[16]
Kapasi, U. J., Dally, J., Rixner, W. S., Mattson, P. R., Owens, J. D., and Khailany, B. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of MICRO.
[17]
Karimi Neil, K., Dickson, G., and Hamze, F. 2010. A performance comparison of CUDA and OpenCL. British Columbia Canada.
[18]
Kerr, A., Diamos, G., and Yalamanchili, S. 2009. A characterization and analysis of PTX kernels. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).
[19]
Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
[20]
Maxime, B. 2010. Ray tracing in CUDA. http://ercbench.ece.wisc.edu/index.php?option=com_content&view= article&id=59:ray-tracing&catid=18:gpgpu&Itemid=20.
[21]
Meng, J., Tarjan, D., and Skadron, K. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Tech. rep. CS-2010-5, University of Virgina.
[22]
Michalakes, J. and Vachharajani, M. GPU acceleration of numerical weather prediction. IEEE International Symposium on Parallel and Distributed Processing. 1--7.
[23]
NVIDIA. 2009. CUDA C programming best practices guide. CUDA Toolkit 2.3.
[24]
NVIDIA CORPORATION. 2009. NVIDIA's next generation CUDA compute architecture.
[25]
Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA'00).
[26]
Schatz, M., Trapnell, C., Delcher, A., and Varshney, A. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics 8, 1, 474.
[27]
Wong, H., Papadopoulou, M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. Department of Electrical and Computer Engineering, University of Toronto.

Cited By

View all
  • (2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
  • (2018)Using program branch probability for the thread parallelisation of branch divergence on the CUDA platformInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2018.09203111:2(171-191)Online publication date: 21-Dec-2018
  • (2018)Control Divergence Optimization through Partial Warp Regrouping in GPGPUsProceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence10.1145/3297156.3297201(369-374)Online publication date: 8-Dec-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 8, Issue 4
Special Issue on High-Performance Embedded Architectures and Compilers
January 2012
765 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2086696
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2012
Accepted: 01 December 2011
Revised: 01 October 2011
Received: 01 July 2011
Published in TACO Volume 8, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. parallel machines
  3. scheduling algorithm

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)132
  • Downloads (Last 6 weeks)12
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
  • (2018)Using program branch probability for the thread parallelisation of branch divergence on the CUDA platformInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2018.09203111:2(171-191)Online publication date: 21-Dec-2018
  • (2018)Control Divergence Optimization through Partial Warp Regrouping in GPGPUsProceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence10.1145/3297156.3297201(369-374)Online publication date: 8-Dec-2018
  • (2015)Enhanced GPU Resource Utilization through Fairness-aware Task SchedulingProceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 0310.1109/Trustcom.2015.611(45-52)Online publication date: 20-Aug-2015
  • (2014)VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud GamingACM Transactions on Architecture and Code Optimization (TACO)10.1145/263221611:2(1-25)Online publication date: 15-Jul-2014
  • (2012)A quantitative study of irregular programs on GPUsProceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2012.6402918(141-151)Online publication date: 4-Nov-2012

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media