Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2600212.2600228acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

Published: 23 June 2014 Publication History

Abstract

We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such "many-task" applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a high-productivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.

References

[1]
I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, and B. Clifford, "Toward loosely coupled programming on petascale systems," in Proc. of 2008 ACM/IEEE Conf. on Supercomputing, ser. SC '08. Piscataway, NJ: IEEE Press, 2008, pp. 22:1--22:12.
[2]
I. Raicu, Many-task computing: bridging the gap between high-throughput computing and high-performance computing. ProQuest, 2009.
[3]
I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, "Falkon: a Fast and Light-weight tasK executiON framework," in Proc. of the 2007 ACM/IEEE Conf. on Supercomputing (SC'07). New York, NY, USA: ACM, 2007, pp. 43:1--43:12.
[4]
M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster, "Swift: A language for distributed parallel scripting," Parallel Computing, vol. 37, pp. 633--652, 2011.
[5]
J. M. Wozniak, M. Wilde, and D. S. Katz, "JETS: Language and system support for many-parallel-task workflows," J. Grid Computing, vol. 11, no. 3, pp. 341--360, 2013.
[6]
M. Hategan, J. Wozniak, and K. Maheshwari, "Coasters: uniform resource provisioning and access for scientific computing on clouds and grids," in Proc. Utility and Cloud Computing, 2011, pp. 114--121.
[7]
D. P. Anderson, "Boinc: A system for public-resource computing and storage," in Proc of 5th IEEE/ACM Intl. Workshop on Grid Computing. IEEE, 2004.
[8]
A. B. Yoo, M. A. Jette, and M. Grondona, "Slurm: Simple linux utility for resource management," in Job Scheduling Strategies for Parallel Processing. Springer, 2003, pp. 44--60.
[9]
N. Desai, "Cobalt: an open source platform for hpc system software research," in Edinburgh BG/L System Software Workshop, 2005.
[10]
IBM, "Sub-block jobs," in IBM System Blue Gene Solution: Blue Gene/Q System Administration, 2013, pp. 80--81, Sec. 6.3.
[11]
K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica, \The case for tiny tasks in compute clusters," in Proc. of the 14th USENIX Conf. on Hot Topics in Operating Systems. USENIX Association, 2013, pp. 14--14.
[12]
L. V. Kale and G. Zheng, "Charm++ and ampi: Adaptive runtime strategies via migratable objects," Advanced Computational Infrastructures for Parallel and Distributed Applications, pp. 265--282, 2009.
[13]
S. Wienke, P. Springer, C. Terboven, and D. an Mey, "OpenACC - first experiences with real-world applications," in Euro-Par 2012 Parallel Processing. Springer, 2012, pp. 859--870.
[14]
NVIDIA Inc., "CUDA C Programming Guide PG-02829-001 v5.5, Section 3.2.5, Asynchronous Concurrent Execution," 2013.
[15]
NVIDIA Inc., "CUDA C Programming Guide PG-02829-001 v5.5, Appendix C, Dynamic Parallelism Execution," 2013.
[16]
J. Johnson, S. J. Krieder, B. Grimmer, J. M. Wozniak, M. Wilde, and I. Raicu, "Understanding the costs of many-task computing workloads on intel xeon phi coprocessors," in 2nd Greater Chicago Area System Research Workshop (GCASR), 2013.
[17]
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, reliable, loosely coupled parallel computation," in Services, 2007 IEEE Congress on, 2007, pp. 199--206.
[18]
J. M. Wozniak, T. G. Armstrong, M. Wilde, D. S. Katz, E. Lusk, and I. T. Foster, "Swift/T: Scalable data ow programming for many-task applications," in Proc. CCGrid, 2013.
[19]
T. G. Armstrong, J. M. Wozniak, M. Wilde, and I. T. Foster, "Compiler optimization for data-driven task parallelism on distributed memory systems," ANL/MCS-P5030--1013.
[20]
J. M. Wozniak, T. G. Armstrong, K. Maheshwari, E. L. Lusk, D. S. Katz, M. Wilde, and I. T. Foster, "Turbine: A distributed-memory data ow engine for high performance many-task applications," vol. 28, no. 3, pp. 337--366, 2013, fundamenta Informaticae 128(3).
[21]
NCSA, "Blue Waters User Portal," 2014, https://bluewaters.ncsa.illinois.edu/hardware-summary.
[22]
J. Burkardt, "MD - molecular dynamics," 2013, http://people.sc.fsu.edu/~jburkardt/cppsrc/md/md.html.
[23]
A. N. Adhikari, J. Peng, M. Wilde, J. Xu, K. F. Freed, and T. R. Sosnick, "Modeling large regions in proteins: Applications to loops, termini, and folding," Protein Science, vol. 21, no. 1, pp. 107--121, 2012.
[24]
S. S. Hampton, P. Brenner, A. Wenger, S. Chatterjee, and J. A. Izaguirre, "Biomolecular sampling: Algorithms,test molecules, and metrics," in New Algorithms for Macromolecular Simulation, ser. Lecture Notes in Computational Science and Engineering, B. Leimkuhler, C. Chipot, R. Elber, A. Laaksonen, A. Mark, T. Schlick, C. SchÃijtte, and R. Skeel, Eds. Springer-Verlag, New York, 2006, vol. 49, pp. 103--121.
[25]
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA graph algorithms at maximum warp," in Proc. of the 16th ACM Symp. on Principles and practice of parallel programming, ser. PPoPP '11. New York, NY, USA: ACM, 2011, pp. 267--276.
[26]
L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, "Dynamic load balancing on single-and multi-gpu systems," in IEEE Intl. Symp. on Parallel & Distributed Processing (IPDPS). IEEE, 2010.
[27]
L. Chen, O. Villa, and G. R. Gao, "Exploring fine-grained task-based execution on multi-gpu systems," in 2011 IEEE Intl. Conf. on Cluster Computing (CLUSTER). IEEE, 2011, pp. 386--394.
[28]
C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "Ptask: operating system abstractions to manage GPUs as compute devices," in Proc. of the Twenty-Third ACM Symp. on Operating Systems Principles. ACM, 2011, pp. 233--248.
[29]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, "Dandelion: a compiler and runtime for heterogeneous systems," in Proc. of the Twenty-Fourth ACM Symp. on Operating Systems Principles. ACM, 2013, pp. 49--68.
[30]
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, "Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework," in Proc. of the 20th Intl. Symp. on High performance distributed computing, ser. HPDC '11. New York, NY, USA: ACM, 2011, pp. 217--228.
[31]
M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar, "A virtual memory based runtime to support multi-tenancy in clusters with GPUs," in Proc. of the 21st Intl. Symp. on High-Performance Parallel and Distributed Computing. ACM, 2012, pp. 97--108.
[32]
V. Gupta, K. Schwan, N. Tolia, V. Talwar, and P. Ranganathan, "Pegasus: coordinated scheduling for virtualized accelerator-based systems," in Proc. of the 2011 USENIX Annual Technical Conf., ser. USENIXATC'11. Berkeley, CA, USA: USENIX Association, 2011, pp. 3--3.
[33]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: a unified platform for task scheduling on heterogeneous multicore architectures," Concurrency and Computation: Practice and Experience, vol. 23, no. 2, pp. 187--198, 2011.
[34]
C. Zhang, G. Han, and C.-L. Wang, "GPU-TLS: An efficient runtime for speculative loop parallelization on gpus," in 13th IEEE/ACM Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 2013.
[35]
S. Chatterjee, M. Grossman, A. Sbîrlea, and V. Sarkar, "Dynamic task parallelism with a gpu work-stealing runtime system," in Languages and Compilers for Parallel Computing. Springer, 2013, pp. 203--217.
[36]
S. Cadambi, G. Coviello, C.-H. Li, R. Phull, K. Rao, M. Sankaradass, and S. Chakradhar, "COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors," in Proc. of the 22nd Intl. Symp. on High-performance parallel and distributed computing. ACM, 2013, pp. 215--226.
[37]
S. Lee and R. Eigenmann, "OpenMPC: Extended OpenMP for efficient programming and tuning on GPUs," Intl. J. of Computational Science and Eng., 2012.
[38]
T. R. Scogland, B. Rountree, W.-c. Feng, and B. R. de Supinski, "Heterogeneous task scheduling for accelerated OpenMP," in IEEE 26th Intl. Parallel & Distributed Processing Symp. (IPDPS). IEEE, 2012.
[39]
J. Meng, V. A. Morozov, K. Kumaran, V. Vishwanath, and T. D. Uram, "GROPHECY: GPU performance projection from CPU code skeletons," in Proc. of 2011 Intl. Conf. for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 14:1--14:11.
[40]
M. Bauer, S. Treichler, and A. Aiken, "Singe: Leveraging Warp Specialization for High Performance on GPUs," in Proc. of the 19th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, ser. PPoPP '14. New York, NY, USA: ACM, 2014.
[41]
A. M. Aji, L. S. Panwar, F. Ji, M. Chabbi, K. Murthy, P. Balaji, K. R. Bisset, J. Dinan, W.-c. Feng, J. Mellor-Crummey et al., "On the efficacy of GPU-integrated MPI for scientific applications," in Proc. of the 22nd Intl. Symp. on High-Performance Parallel and Distributed Computing. ACM, 2013.

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)Towards Fine-Grained Parallelism in Parallel and Distributed Python Libraries2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00133(706-715)Online publication date: 27-May-2024
  • (2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
June 2014
334 pages
ISBN:9781450327497
DOI:10.1145/2600212
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. accelerators
  2. cuda
  3. execution models
  4. gpgpu
  5. hybrid execution
  6. many-task computing
  7. programming models
  8. workflow

Qualifiers

  • Research-article

Conference

HPDC'14
Sponsor:

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)5
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)Towards Fine-Grained Parallelism in Parallel and Distributed Python Libraries2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00133(706-715)Online publication date: 27-May-2024
  • (2024)A Runtime Manager Integrated Emulation Environment for Heterogeneous SoC Design with RISC-V Cores2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00013(23-30)Online publication date: 27-May-2024
  • (2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
  • (2023)CEDR: A Compiler-integrated, Extensible DSSoC RuntimeACM Transactions on Embedded Computing Systems10.1145/352925722:2(1-34)Online publication date: 24-Jan-2023
  • (2022)A Provenance-based Execution Strategy for Variant GPU-accelerated Scientific Workflows in CloudsJournal of Grid Computing10.1007/s10723-022-09625-y20:4Online publication date: 1-Dec-2022
  • (2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
  • (2020)TReM: A Task Revocation Mechanism for GPUs2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00034(273-282)Online publication date: Dec-2020
  • (2019)Concurrent query processing in a GPU-based database systemPLOS ONE10.1371/journal.pone.021472014:4(e0214720)Online publication date: 16-Apr-2019
  • (2019)PagodaACM Transactions on Parallel Computing10.1145/33656576:4(1-23)Online publication date: 19-Nov-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media