research-article

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Authors:

Jesmin Jahan Tithi,

Pramod Ganapathi,

Rezaul A. ChowdhuryAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 8

Pages 205 - 214

https://doi.org/10.1145/2858788.2688514

Published: 24 January 2015 Publication History

Abstract

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1–12, 2000.

Digital Library

[2]

K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1–12. IEEE, April 2010.

[3]

A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, 1974.

Digital Library

[4]

R. Bellman. Dynamic Programming. Princeton University Press, 1957.

Digital Library

[5]

G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzarán, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 48–57, New York, NY, USA, 2006. ACM.

Digital Library

[6]

R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486–1505, 1992. ISSN 0022-3670.

[7]

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235–244. ACM, 2004.

Digital Library

[8]

G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 355–366, New York, NY, USA, 2011. ACM.

Digital Library

[9]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207–216, Santa Barbara, California, July 1995.

Digital Library

[10]

S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0:42, 2000.

Digital Library

[11]

R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007.

Digital Library

[12]

R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207–216, 2008.

Digital Library

[13]

R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878–919, 2010.

Digital Library

[14]

R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911–925, 2013. A preliminary version appeared as {17}.

[15]

R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591–600, 2006.

Digital Library

[16]

R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495–510, July-Sept. 2010.

Digital Library

[17]

R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, pages 1––12, April 2010.

[18]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.

Digital Library

[19]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1–4:12, Austin, TX, Nov. 15–18 2008.

Digital Library

[20]

R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.

[21]

H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533–538, Las Vegas, NV, July13–16 2009.

[22]

R. Floyd. Algorithm 97 (SHORTEST PATH). Commun. ACM, 5(6): 345, 1962.

Digital Library

[23]

M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361–366, Cambridge, MA, June 20–22 2005.

Digital Library

[24]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271–280, 2006.

Digital Library

[25]

M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203– 233, 2009.

Digital Library

[26]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI ’98, pages 212–223, 1998.

Digital Library

[27]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In FOCS, pages 285–297, New York, NY, Oct. 17–19 1999.

Digital Library

[28]

Z. Galil and R. Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64: 107–118, 1989.

Digital Library

[29]

Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213–222, 1994.

Digital Library

[30]

R. Giegerich, C. Meyer, and P. Steffen. A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3):215–263, 2004.

Digital Library

[31]

O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–708, 1982.

[32]

J. Guo, G. Biksh, B. B. Fraguela, M. J. Garzarn, and D. Padua. Programming with tiles. In In PPoPP 08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008.

Digital Library

[33]

D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997.

[34]

Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145–156, Santorini, Greece, June 13–15 2010.

Digital Library

[35]

Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/ architecture-and-technology/many-integrated-core/ intel-many-integrated-core-architecture.html, 2011.

[36]

S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36–43, Chicago, IL, June 12 2005.

Digital Library

[37]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51–60, San Jose, CA, 2006. ISBN 1-59593-578-9. URL http://doi.acm.org/10.1145/1178597.1178605.

Digital Library

[38]

J. O. S. Kennedy. Applications of dynamic programming to agriculture, forestry and fisheries: Review and prognosis. Review of Marketing and Agricultural Economics, 49(03), 1981.

[39]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10–13 2007.

Digital Library

[40]

S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’14, pages 219–232, New York, NY, USA, 2014.

Digital Library

[41]

ACM. ISBN 978-1-4503-2656-8.

[42]

J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010, pages 1–12, 2010.

[43]

A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197–214, 1994. ISSN 0010-4655.

[44]

A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master’s thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.

[45]

L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1–11, Rome, Italy, May 23–29 2009.

Digital Library

[46]

A. A. Robichek, E. J. Elton, and M. J. Gruber. Dynamic programming applications in finance. The Journal of Finance, 26(2):473–506, 1971.

[47]

D. Romer. It’s fourth down and what does the bellman equation say? a dynamic programming analysis of football strategy. Technical report, National Bureau of Economic Research, 2002.

[48]

J. Rust. Numerical dynamic programming in economics. Handbook of computational economics, 1:619–729, 1996.

[49]

J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priority updates. In SPAA, pages 152–163, 2013.

Digital Library

[50]

H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14, pages 30–41, New York, NY, USA, 2014. ACM.

Digital Library

[51]

D. K. Smith. Dynamic programming and board games: A survey. European Journal of Operational Research, 176(3):1299–1318, 2007.

[52]

A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.

[53]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011.

Digital Library

[54]

Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencil-specification language. In HotPar’11, Berkeley, CA, USA, May 2011.

[55]

G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27-28, 2013, Revised Selected Papers, pages 17–31, 2013.

[56]

S. Warshall. A theorem on boolean matrices. J. ACM, 9(1):11–12, 1962.

Digital Library

[57]

M. Waterman. Introduction to Computational Biology. Chapman & Hall, London, UK, 1995.

[58]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1–14, Miami, FL, Apr. 2008.

Cited By

Li YSchwiebert L(2020)Memory-Optimized Wavefront Parallelism on GPUsInternational Journal of Parallel Programming10.1007/s10766-020-00658-yOnline publication date: 25-Mar-2020
https://doi.org/10.1007/s10766-020-00658-y
Li ZChen W(2019)An Algorithmic Model Building Scheme Based on Dynamic Programming AlgorithmsJournal of Physics: Conference Series10.1088/1742-6596/1345/5/0520801345:5(052080)Online publication date: 1-Nov-2019
https://doi.org/10.1088/1742-6596/1345/5/052080
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Show More Cited By

Index Terms

Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency

Recommendations

Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and ...
Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best ...
Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance
WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil Computations

The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 8

PPoPP '15

August 2015

290 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858788

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
699
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YSchwiebert L(2020)Memory-Optimized Wavefront Parallelism on GPUsInternational Journal of Parallel Programming10.1007/s10766-020-00658-yOnline publication date: 25-Mar-2020
https://doi.org/10.1007/s10766-020-00658-y
Li ZChen W(2019)An Algorithmic Model Building Scheme Based on Dynamic Programming AlgorithmsJournal of Physics: Conference Series10.1088/1742-6596/1345/5/0520801345:5(052080)Online publication date: 1-Nov-2019
https://doi.org/10.1088/1742-6596/1345/5/052080
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Chowdhury RGanapathi PTschudi STithi JBachmeier CLeiserson CSolar-Lezama AKuszmaul BTang Y(2017)AutogenACM Transactions on Parallel Computing10.1145/31256324:1(1-30)Online publication date: 5-Oct-2017
https://dl.acm.org/doi/10.1145/3125632
Bilardi GPietracaprina APucci GScquizzato MSilvestri F(2016)Network-Oblivious AlgorithmsJournal of the ACM10.1145/281280463:1(1-36)Online publication date: 30-Mar-2016
https://dl.acm.org/doi/10.1145/2812804
Lakshminarasimhan MHall MWilliams SAntepara O(2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673046
Ding XGu YSun YAgrawal KPetrank E(2024)Parallel and (Nearly) Work-Efficient Dynamic ProgrammingProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659958(219-232)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659958
Blelloch GGu YSun Y(2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00080
Noroozi FDaneshmand MFiorini P(2023)Conventional, Heuristic and Learning-Based Robot Motion Planning: Reviewing Frameworks of Current Practical SignificanceMachines10.3390/machines1107072211:7(722)Online publication date: 7-Jul-2023
https://doi.org/10.3390/machines11070722
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Revisiting Temporal Blocking Stencil OptimizationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593716(251-263)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593716
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents