Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fine-grain task aggregation and coordination on GPUs

Published: 14 June 2014 Publication History

Abstract

In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar opera-tions across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the chan-nel abstraction, which facilitates dynamically aggregating asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed
To this end, we propose and evaluate the first channel im-plementation. To demonstrate the utility of channels, we present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. To support data-parallel recursion in bounded memory, we propose a hardware mechanism that allows wavefronts to yield their execution resources. Through channels and wavefront yield, we im-plement four Cilk benchmarks. We show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight compute units

References

[1]
G. E. Moore, "Cramming More Components onto Integrated Circuits," Proc. IEEE, vol. 86, no. 1, pp. 82--85, Jan. 1998.
[2]
S. R. Gutta, D. Foley, A. Naini, R. Wasmuth, and D. Cherepacha, "A Low-Power Integrated x86--64 and Graphics Processor for Mobile Computing Devices," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 270--272.
[3]
M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A Fully Inte-grated Multi-CPU, GPU and Memory Controller 32nm Processor," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 264--266.
[4]
"Bringing High-End Graphics to Handheld Devices," NVIDIA, 2011.
[5]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," AMD, Aug. 2012.
[6]
"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[7]
"OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.
[8]
B. R. Gaster and L. Howes, "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, Aug. 2012.
[9]
"Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.
[10]
S. Min, C. Iancu, and K. Yelick, "Hierarchical work stealing on manycore clusters," in Fifth Conference on Partitioned Global Address Space Programming Models, 2011.
[11]
J. Valois, "Implementing Lock-Free Queues," in Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, 1994, pp. 64--69.
[12]
C. Gong and J. M. Wing, "A Library of Concurrent Objects and Their Proofs of Correctness," Carnegie Mellon University, Technical Report, 1990.
[13]
A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors," ACM Trans Program Lang Syst, vol. 5, no. 2, pp. 164--189, Apr. 1983.
[14]
M. M. Michael and M. L. Scott, "Nonblocking Algorithms and Preemption-safe Locking on Multiprogrammed Shared Memory Multiprocessors," J Parallel Distrib Comput, vol. 51, no. 1, pp. 1--26, May 1998.
[15]
E. Ladan-mozes and N. Shavit, "An Optimistic Approach to Lock-Free FIFO queues," in Proceedings of the 18th International Symposium on Distributed Computing, 2004, pp. 117--131.
[16]
"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG)," HSA Foundation, Spring 2013.
[17]
M. Harris, "Optimizing Parallel Reduction in CUDA," NVIDIA. {Online}. Available: http://developer.download.nvidia.com/assets/ cuda/files/reduction.pdf.
[18]
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Commun ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.
[19]
W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in Proceedings of the 11th International Conference on Compiler Construction, London, UK, 2002, pp. 179--196.
[20]
J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, "GRAMPS: A Programming Model for Graphics Pipelines," ACM Trans Graph, vol. 28, no. 1, pp. 4:1--4:11, Feb. 2009.
[21]
M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.
[22]
G. Diamos and S. Yalamanchili, "Speculative Execution on Multi-GPU Systems," in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1--12.
[23]
Y. Guo, R. Barik, R. Raman, and V. Sarkar, "Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism," in IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, 2009, pp. 1--12.
[24]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Comput Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
[25]
P. Conway and B. Hughes, "The AMD Opteron Northbridge Architecture," IEEE Micro, vol. 27, no. 2, pp. 10--21, Mar. 2007.
[26]
B. A. Hechtman and D. J. Sorin, "Exploring Memory Consistency for Massively-Threaded Throughput-Oriented Processors," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 201--212.
[27]
B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).
[28]
V. Strassen, "The Asymptotic Spectrum of Tensors and the Exponent of Matrix Multiplication," in Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Washington, D.C., USA, 1986, pp. 49--54.
[29]
AMD Corporation, "AMD Accelerated Parallel Processing SDK." {Online}. Available: http://developer.amd.com/tools-and-sdks/.
[30]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, 2009, pp. 163--174.
[31]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Thread Scheduling for Massively Multithreaded Processors," IEEE Micro, vol. 33, no. 3, pp. 78--85, May 2013.
[32]
M. Steffen and J. Zambreno, "Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Washington, D.C., USA, 2010, pp. 237--248.
[33]
J. Hoberock, V. Lu, Y. Jia, and J. C. Hart, "Stream Compaction for Deferred Shading," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 173--180.
[34]
M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: Dynamic Scheduling on GPUs," ACM Trans Graph, vol. 31, no. 6, pp. 161:1--161:11, Nov. 2012.
[35]
T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 145--149.
[36]
S. Tzeng, A. Patney, and J. D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," in Proceedings of the Conference on High Performance Graphics, Airela-Ville, Switzerland, 2010, pp. 29--37.
[37]
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, N.Y., USA, 2011, pp. 296--307.
[38]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 142--153.

Cited By

View all
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2022)QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPUParallel Computing10.1016/j.parco.2022.102958113(102958)Online publication date: Oct-2022
  • (2021)GraphPEGACM Transactions on Architecture and Code Optimization10.1145/345044018:3(1-24)Online publication date: 10-May-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 42, Issue 3
ISCA '14
June 2014
552 pages
ISSN:0163-5964
DOI:10.1145/2678373
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
    June 2014
    566 pages
    ISBN:9781479943944

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2014
Published in SIGARCH Volume 42, Issue 3

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)4
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2022)QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPUParallel Computing10.1016/j.parco.2022.102958113(102958)Online publication date: Oct-2022
  • (2021)GraphPEGACM Transactions on Architecture and Code Optimization10.1145/345044018:3(1-24)Online publication date: 10-May-2021
  • (2021)Systems-on-Chip with Strong OrderingACM Transactions on Architecture and Code Optimization10.1145/342815318:1(1-27)Online publication date: 20-Jan-2021
  • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
  • (2019)From piz daint to the starsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356221(1-37)Online publication date: 17-Nov-2019
  • (2019)SMQoS: Improving Utilization and Energy Efficiency with QoS Awareness on GPUs2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891047(1-5)Online publication date: Sep-2019
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2016)KLAPThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195654(1-12)Online publication date: 15-Oct-2016
  • (2015)Efficient execution of recursive programs on commodity vector hardwareACM SIGPLAN Notices10.1145/2813885.273800450:6(509-520)Online publication date: 3-Jun-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media