Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2935764.2935797acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Public Access

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Published: 11 July 2016 Publication History

Abstract

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "||" (parallel) and ";" (serial), that comprise the nested-parallel model are insufficient in expressing "partial dependencies" in a program. We propose a new dataflow composition construct "↝" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and prove guarantees on their ability to balance nodes across processors and preserve locality. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O((∑i=0h-1 Q*(t;σ⋅ Mi)⋅ Ci)/p) on a p-processor machine, where Q* is the parallel cache complexity of task t, Ci is the cost of cache miss at level-i cache which is of size Mi, and σ∈(0,1) is a constant.

References

[1]
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1--12, 2000.
[2]
K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1--12. IEEE, April 2010.
[3]
E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The plasma and magma projects. Journal of Physics: Conference Series, 180(1):012037, 2009.
[4]
B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. In Proc. Programming Models for Massively Parallel Computers, pages 116--123. IEEE Computer Society Press, 1993.
[5]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In SPAA '98, pages 119--129, June 1998.
[6]
H. C. Baker, Jr. and C. Hewitt. The incremental garbage collection of processes. In Proceedings of the 1977 Symposium on Artificial Intelligence and Programming Languages, pages 55--59, New York, NY, USA, 1977. ACM.
[7]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Matrix Analysis Applications, 32(3):866--901, 2011.
[8]
M. A. Bender, R. Ebrahimi, J. T. Fineman, G. Ghasemiesfeh, R. Johnson, and S. McCauley. Cache-adaptive algorithms. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5--7, 2014, pages 958--971, 2014.
[9]
P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortı, and R. A. v. d. Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw., 31(1):1--26, Mar. 2005.
[10]
G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, 1999.
[11]
G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 355--366. ACM, 2011.
[12]
G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235--244. ACM, 2004.
[13]
G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 12--23, Newport, Rhode Island, June 1997.
[14]
G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '10, pages 189--199, New York, NY, USA, 2010. ACM.
[15]
G. E. Blelloch and M. Reid-Miller. Pipelining with futures. In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '97, pages 249--259, New York, NY, USA, 1997. ACM.
[16]
R. D. Blumofe and C. E. Leiserson. Space-efficient scheduling of multithreaded computations. In Proceedings of the Twenty Fifth Annual ACM Symposium on Theory of Computing, pages 362--371, San Diego, California, 1993.
[17]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, Sept. 1999.
[18]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. Yarkhan, and J. Dongarra. Distributed dense numerical linear algebra algorithms on massively parallel architectures: Dplasma. Technical Report UT-CS-10--660, University of Tennessee Computer Science, September 2013.
[19]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, 15(6):36--45, 2013.
[20]
G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra. Dague: A generic distributedDAG\ engine for high performance computing. Parallel Computing, 38(1--2):37 -- 51, 2012. Extensions for Next-Generation Parallel Programming Models.
[21]
E. Chan and F. D. Igual. Runtime data flow graph scheduling of matrix computations with multiple hardware accelerators, FLAME Working Note#50, October 2010.
[22]
R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911--925, 2013. A preliminary version appeared in IPDPS '10.
[23]
R. Cole and V. Ramachandran. Efficient resource oblivious algorithms for multicores. CoRR, abs/1103.4071, 2011.
[24]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.
[25]
D. Dinh, H. V. Simhadri, and Y. Tang. Extending the nested parallel model to the nested dataflow model with provably efficient schedulers. CoRR, abs/1602.04552, 2016.
[26]
D. Friedman and D. Wise. Aspects of applicative programming for parallel processing. Computers, IEEE Transactions on, C-27(4):289--296, April 1978.
[27]
Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213--222, 1994.
[28]
O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705--708, 1982.
[29]
J. Greiner and G. E. Blelloch. A provably time-efficient parallel implementation of full speculation. ACM Trans. Program. Lang. Syst., 21(2):240--285, Mar. 1999.
[30]
J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4):422--455, Dec. 2001.
[31]
M. Herlihy and Z. Liu. Well-structured futures and cache locality. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 155--166. ACM, 2014.
[32]
T. Johnson, T. A. Davis, and S. M. Hadfield. A concurrent dynamic task graph. Parallel Comput., 22(2):327--333, 1996.
[33]
W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages. ACM Comput. Surv., 36(1):1--34, Mar. 2004.
[34]
I.-T. A. Lee, S. Boyd-Wickizer, Z. Huang, and C. E. Leiserson. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 411--420, New York, NY, USA, 2010. ACM.
[35]
I.-T. A. Lee, C. E. Leiserson, T. B. Schardl, J. Sukha, and Z. Zhang. On-the-fly pipeline parallelism. In Proceedings of the Twenty-fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '13, pages 140--151, New York, NY, USA, 2013. ACM.
[36]
C. E. Leiserson, T. B. Schardl, and J. Sukha. Deterministic parallel random-number generation for dynamic-multithreading platforms. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 193--204, New York, NY, USA, 2012. ACM.
[37]
S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'14, pages 219--232, New York, NY, USA, 2014. ACM.
[38]
G. Narlikar. Space-Efficient Scheduling for Parallel, Multithreaded Computations. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May 1999.
[39]
K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25. ACM, 2011.
[40]
J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw., 39(2):13:1--13:24, Feb. 2013.
[41]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013.
[42]
H. V. Simhadri. Program-Centric Cost Models for Locality and Parallelism. PhD thesis, Carnegie Mellon University, 2013.
[43]
H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. Transactions on Parallel Computing, 3(1), 2016. A preliminary version appeared in SPAA '14.
[44]
D. Spoonhower, G. E. Blelloch, P. B. Gibbons, and R. Harper. Beyond nested parallelism: Tight bounds on work-stealing overheads for parallel futures. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, SPAA '09, pages 91--100, New York, NY, USA, 2009. ACM.
[45]
Y. Tang, R. You, H. Kan, J. J. Tithi, P. Ganapathi, and R. A. Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In PPoPP'15, San Francisco, CA, USA, Feb.7 -- 11 2015.

Cited By

View all
  • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
  • (2023)Provably Fast and Space-Efficient Parallel BiconnectivityProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577483(52-65)Online publication date: 25-Feb-2023
  • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures
July 2016
492 pages
ISBN:9781450342100
DOI:10.1145/2935764
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache-oblivious algorithms
  2. cache-oblivious wavefront
  3. data-flow
  4. dynamic programming
  5. fork-join
  6. nested parallelism
  7. numerical algorithms
  8. parallel programming models
  9. shared-memory multicore processors
  10. space-bounded scheduler

Qualifiers

  • Research-article

Funding Sources

Conference

SPAA '16

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)89
  • Downloads (Last 6 weeks)19
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
  • (2023)Provably Fast and Space-Efficient Parallel BiconnectivityProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577483(52-65)Online publication date: 25-Feb-2023
  • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
  • (2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
  • (2020)Optimal Parallel Algorithms in the Binary-Forking ModelProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400227(89-102)Online publication date: 6-Jul-2020
  • (2019)BLAS-on-flashProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323273(469-483)Online publication date: 26-Feb-2019
  • (2019)Detecting Non-sibling Dependencies in OpenMP Task-Based ApplicationsOpenMP: Conquering the Full Hardware Spectrum10.1007/978-3-030-28596-8_16(231-245)Online publication date: 9-Aug-2019
  • (2017)POSTERACM SIGPLAN Notices10.1145/3155284.301902952:8(455-456)Online publication date: 26-Jan-2017
  • (2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087593(279-281)Online publication date: 24-Jul-2017
  • (2017)POSTERProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3019029(455-456)Online publication date: 26-Jan-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media