research-article

Public Access

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Authors:

Harsha Vardhan Simhadri,

Yuan TangAuthors Info & Claims

SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

Pages 49 - 60

https://doi.org/10.1145/2935764.2935797

Published: 11 July 2016 Publication History

Abstract

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "||" (parallel) and ";" (serial), that comprise the nested-parallel model are insufficient in expressing "partial dependencies" in a program. We propose a new dataflow composition construct "↝" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and prove guarantees on their ability to balance nodes across processors and preserve locality. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O((∑_i=0^h-1 Q*(t;σ⋅ M_i)⋅ C_i)/p) on a p-processor machine, where Q* is the parallel cache complexity of task t, C_i is the cost of cache miss at level-i cache which is of size M_i, and σ∈(0,1) is a constant.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1--12, 2000.

Digital Library

[2]

K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1--12. IEEE, April 2010.

[3]

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The plasma and magma projects. Journal of Physics: Conference Series, 180(1):012037, 2009.

[4]

B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Proc. Programming Models for Massively Parallel Computers, pages 116--123. IEEE Computer Society Press, 1993.

[5]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129, June 1998.

Digital Library

[6]

H. C. Baker, Jr. and C. Hewitt. The incremental garbage collection of processes. In Proceedings of the 1977 Symposium on Artificial Intelligence and Programming Languages, pages 55--59, New York, NY, USA, 1977. ACM.

Digital Library

[7]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Applications, 32(3):866--901, 2011.

[8]

M. A. Bender, R. Ebrahimi, J. T. Fineman, G. Ghasemiesfeh, R. Johnson, and S. McCauley. Cache-adaptive algorithms. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5--7, 2014, pages 958--971, 2014.

Digital Library

[9]

P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortı, and R. A. v. d. Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw., 31(1):1--26, Mar. 2005.

Digital Library

[10]

G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, 1999.

Digital Library

[11]

G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 355--366. ACM, 2011.

Digital Library

[12]

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235--244. ACM, 2004.

Digital Library

[13]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 12--23, Newport, Rhode Island, June 1997.

Digital Library

[14]

G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '10, pages 189--199, New York, NY, USA, 2010. ACM.

Digital Library

[15]

G. E. Blelloch and M. Reid-Miller. Pipelining with futures. In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '97, pages 249--259, New York, NY, USA, 1997. ACM.

Digital Library

[16]

R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing, pages 362--371, San Diego, California, 1993.

Digital Library

[17]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, Sept. 1999.

Digital Library

[18]

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. Yarkhan, and J. Dongarra. Distributed dense numerical linear algebra algorithms on massively parallel architectures: Dplasma. Technical Report UT-CS-10--660, University of Tennessee Computer Science, September 2013.

[19]

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, 15(6):36--45, 2013.

Digital Library

[20]

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra. Dague: A generic distributedDAG\ engine for high performance computing. Parallel Computing, 38(1--2):37 -- 51, 2012. Extensions for Next-Generation Parallel Programming Models.

Digital Library

[21]

E. Chan and F. D. Igual. Runtime data flow graph scheduling of matrix computations with multiple hardware accelerators, FLAME Working Note#50, October 2010.

[22]

R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911--925, 2013. A preliminary version appeared in IPDPS '10.

[23]

R. Cole and V. Ramachandran. Efficient resource oblivious algorithms for multicores. CoRR, abs/1103.4071, 2011.

[24]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.

Digital Library

[25]

D. Dinh, H. V. Simhadri, and Y. Tang. Extending the nested parallel model to the nested dataflow model with provably efficient schedulers. CoRR, abs/1602.04552, 2016.

[26]

D. Friedman and D. Wise. Aspects of applicative programming for parallel processing. Computers, IEEE Transactions on, C-27(4):289--296, April 1978.

Digital Library

[27]

Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213--222, 1994.

Digital Library

[28]

O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705--708, 1982.

[29]

J. Greiner and G. E. Blelloch. A provably time-efficient parallel implementation of full speculation. ACM Trans. Program. Lang. Syst., 21(2):240--285, Mar. 1999.

Digital Library

[30]

J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4):422--455, Dec. 2001.

Digital Library

[31]

M. Herlihy and Z. Liu. Well-structured futures and cache locality. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 155--166. ACM, 2014.

Digital Library

[32]

T. Johnson, T. A. Davis, and S. M. Hadfield. A concurrent dynamic task graph. Parallel Comput., 22(2):327--333, 1996.

Digital Library

[33]

W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages. ACM Comput. Surv., 36(1):1--34, Mar. 2004.

Digital Library

[34]

I.-T. A. Lee, S. Boyd-Wickizer, Z. Huang, and C. E. Leiserson. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 411--420, New York, NY, USA, 2010. ACM.

Digital Library

[35]

I.-T. A. Lee, C. E. Leiserson, T. B. Schardl, J. Sukha, and Z. Zhang. On-the-fly pipeline parallelism. In Proceedings of the Twenty-fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '13, pages 140--151, New York, NY, USA, 2013. ACM.

Digital Library

[36]

C. E. Leiserson, T. B. Schardl, and J. Sukha. Deterministic parallel random-number generation for dynamic-multithreading platforms. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 193--204, New York, NY, USA, 2012. ACM.

Digital Library

[37]

S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'14, pages 219--232, New York, NY, USA, 2014. ACM.

Digital Library

[38]

G. Narlikar. Space-Efficient Scheduling for Parallel, Multithreaded Computations. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May 1999.

Digital Library

[39]

K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25. ACM, 2011.

Digital Library

[40]

J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1--13:24, Feb. 2013.

Digital Library

[41]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013.

Digital Library

[42]

H. V. Simhadri. Program-Centric Cost Models for Locality and Parallelism. PhD thesis, Carnegie Mellon University, 2013.

[43]

H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. Transactions on Parallel Computing, 3(1), 2016. A preliminary version appeared in SPAA '14.

Digital Library

[44]

D. Spoonhower, G. E. Blelloch, P. B. Gibbons, and R. Harper. Beyond nested parallelism: Tight bounds on work-stealing overheads for parallel futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, SPAA '09, pages 91--100, New York, NY, USA, 2009. ACM.

Digital Library

[45]

Y. Tang, R. You, H. Kan, J. J. Tithi, P. Ganapathi, and R. A. Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In PPoPP'15, San Francisco, CA, USA, Feb.7 -- 11 2015.

Digital Library

Cited By

Blelloch GGu YSun Y(2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00080
Dong XWang LGu YSun YDehnavi MKulkarni MKrishnamoorthy S(2023)Provably Fast and Space-Efficient Parallel BiconnectivityProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577483(52-65)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577483
Shen ZWan ZGu YSun YAgrawal KLee I(2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538574
Show More Cited By

Index Terms

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms
  2. Semantics and reasoning
    1. Program constructs
      1. Control primitives

Recommendations

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications
Special issue: 18^th International parallel and distributed processing symposium

In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of OpenMP directives and provides ...
Space-efficient implementation of nested parallelism

Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an ...
Space-efficient implementation of nested parallelism
PPOPP '97: Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming

Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

July 2016

492 pages

ISBN:9781450342100

DOI:10.1145/2935764

General Chair:
Christian Scheideler
University of Paderborn
,
Program Chair:
Seth Gilbert
National University of Singapore

Copyright © 2016 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Google
Nvidia
Defense Advanced Research Projects Agency
Intel Corporation
Mathworks
U.S. Department of Energy
Oracle
Samsung
Shanghai government
Huawei Technologies
LG

Conference

SPAA '16

Sponsor:

SPAA '16: 28th ACM Symposium on Parallelism in Algorithms and Architectures

July 11 - 13, 2016

California, Pacific Grove, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
537
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)19

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Blelloch GGu YSun Y(2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00080
Dong XWang LGu YSun YDehnavi MKulkarni MKrishnamoorthy S(2023)Provably Fast and Space-Efficient Parallel BiconnectivityProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577483(52-65)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577483
Shen ZWan ZGu YSun YAgrawal KLee I(2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538574
Dong XGu YSun YZhang YAgrawal KAzar Y(2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461782
Blelloch GFineman JGu YSun YScheideler CSpear M(2020)Optimal Parallel Algorithms in the Binary-Forking ModelProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400227(89-102)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400227
Subramanya SSimhadri HGarg SKag ABalasubramanian VLorch JYu M(2019)BLAS-on-flashProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323273(469-483)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323273
Bispo Vieira RCapra ACarribault PJaeger JPérache MRoussel A(2019)Detecting Non-sibling Dependencies in OpenMP Task-Based ApplicationsOpenMP: Conquering the Full Hardware Spectrum10.1007/978-3-030-28596-8_16(231-245)Online publication date: 9-Aug-2019
https://doi.org/10.1007/978-3-030-28596-8_16
Tang YYou R(2017)POSTERACM SIGPLAN Notices10.1145/3155284.301902952:8(455-456)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3155284.3019029
Tang YWang SScheideler CHajiaghayi M(2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087593(279-281)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087593
Tang YYou RSarkar VRauchwerger L(2017)POSTERProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3019029(455-456)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3018743.3019029

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents