research-article

Open access

Blaze-Tasks: A Framework for Computing Parallel Reductions over Tasks

Authors:

Peter Pirkelbauer,

Christina Peterson,

Damian DechevAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 4

Article No.: 66, Pages 1 - 25

https://doi.org/10.1145/3293448

Published: 08 January 2019 Publication History

All formats PDF

Abstract

Compared to threads, tasks are a more fine-grained alternative. The task parallel programming model offers benefits in terms of better performance portability and better load-balancing for problems that exhibit nonuniform workloads. A common scenario of task parallel programming is that a task is recursively decomposed into smaller sub-tasks. Depending on the problem domain, the number of created sub-tasks may be nonuniform, thereby creating potential for significant load imbalances in the system. Dynamic load-balancing mechanisms will distribute the tasks across available threads. The final result of a computation may be modeled as a reduction over the results of all sub-tasks.

This article describes a simple, yet effective prototype framework, Blaze-Tasks, for task scheduling and task reductions on shared memory architectures. The framework has been designed with lock-free techniques and generic programming principles in mind. Blaze-Tasks is implemented entirely in C++17 and is thus portable. To load-balance the computation, Blaze-Tasks uses task stealing. To manage contention on a task pool, the number of lock-free attempts to steal a task depends on the distance between thief and pool owner and the estimated number of tasks in a victim’s pool. This article evaluates the Blaze framework on Intel and IBM dual-socket systems using nine benchmarks and compares its performance with other task parallel frameworks. While Cilk outperforms Blaze on Intel on most benchmarks, the evaluation shows that Blaze is competitive with OpenMP and other library-based implementations. On IBM, the experiments show that Blaze outperforms other approaches on most benchmarks.

References

[1]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129.

Digital Library

[2]

Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. 2012. Clarifying and compiling C/C++ concurrency: From C++11 to POWER. SIGPLAN Not. 47, 1 (Jan. 2012), 509--520.

Digital Library

[3]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sep. 1999), 720--748.

Digital Library

[4]

Hans-J. Boehm. 2002. Bounding space usage of conservative garbage collectors. In Proceeedings of the 2002 ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

Digital Library

[5]

Hans-J. Boehm. 2016. Temporarily Discourage Memory_order_consume. Technical Report P0371R1. JTC1/SC22/WG21 C++ Standards Committee.

[6]

Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). ACM, New York, NY, 68--78.

Digital Library

[7]

François Broquedis, Thierry Gautier, and Vincent Danjean. 2012. LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World (IWOMP’12). Springer-Verlag, Berlin, 102--115.

Digital Library

[8]

Trevor Alexander Brown. 2015. Reclaiming memory for lock-free data structures: There has to be a better way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing (PODC’15). ACM, New York, NY, 261--270.

Digital Library

[9]

J. Ciesko, S. Mateo, X. Teruel, X. Martorell, E. Ayguade, J. Labarta, A. Duran, B. De Supinski, S. Olivier, K. Li, and A. Eichenberger. 2015. Towards task-parallel reductions in OpenMP. In Proceedings of the International Workshop on OpenMP. Springer, 189--201.

[10]

de Supinski, Bronis and Klemm, Michael (eds.). 2017. OpenMP Technical Report 6:Version 5.0 Preview 2. Technical Report. OpenMP Architecture Review Board.

[11]

Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2006. Lock-free dynamically resizable arrays. In Proceedings of the International Conference on Principles of Distributed Systems (OPODIS’06), Alexander A. Shvartsman (Ed.), Lecture Notes in Computer Science, Vol. 4305. Springer, 142--156.

Digital Library

[12]

James C. Dehnert and Alexander A. Stepanov. 2000. Fundamentals of generic programming. In Selected Papers from the International Seminar on Generic Programming. Springer-Verlag, London, UK, 1--11.

Digital Library

[13]

Noah Evans, Stephen L. Olivier, Richard Barrett, and George Stelle. 2017. Scheduling chapel tasks with qthreads on manycore: A tale of two schedulers. In Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’17). ACM, New York, NY, Article 4, 8 pages.

Digital Library

[14]

Steven Feldman and Damian Dechev. 2015. A wait-free multi-producer multi-consumer ring buffer. SIGAPP Appl. Comput. Rev. 15, 3 (Oct. 2015), 59--71.

Digital Library

[15]

Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and other cilk++ hyperobjects. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures (SPAA’09). ACM, New York, NY, 79--90.

Digital Library

[16]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, NY, 212--223.

Digital Library

[17]

Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. 2007. Performance of memory reclamation for lockless synchronization. J. Parallel Distrib. Comput. 67, 12 (Dec. 2007), 1270--1285.

Digital Library

[18]

Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. 2005. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst. 23, 2 (2005), 146--196.

Digital Library

[19]

Maurice Herlihy and Nir Shavit. 2012. The Art of Multiprocessor Programming (revised 1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.

Digital Library

[20]

Intel Corp.2018. Reference for Intel Threading Building Blocks, Version 1.0. Retrieved on May 5, 2018 from http://threadingbuildingblocks.org/.

[21]

ISO/IEC 14882:2017(E) International Standard. 2017. Programming Language C++. JTC1/SC22/WG21 - The C++ Standards Committee.

[22]

Jaakko Järvi and John Freeman. 2010. C++ lambda expressions and closures. Sci. Comput. Program. 75, 9 (Sep. 2010), 762--772.

Digital Library

[23]

Jim Lambers. 2009. Lecture Notes on Adaptive Quadrature. Retrieved April 22, 2018 from http://www.math.usm.edu/lambers/mat460/fall09/lecture30.

[24]

Charles E. Leiserson. 2009. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference (DAC’09). ACM, New York, NY, 522--527.

Digital Library

[25]

Maged M. Michael. 2002. Safe memory reclamation for dynamic lock-free objects using atomic reads and writes. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC’02). ACM Press, New York, NY, 21--30.

Digital Library

[26]

Maged M. Michael. 2003. CAS-based lock-free algorithm for shared deques. In Proceedings of the European Conference on Parallel Processing (Euro-Par’03), Lecture Notes in Computer Science, Vol. 2790. 651--660.

[27]

Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing. ACM, 267--275.

Digital Library

[28]

Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical work stealing on manycore clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models.

[29]

Ananya Muddukrishna, Peter A. Jonsson, and Mats Brorsson. 2016. Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors. Sci. Program. 2015, Article 5 (Jan. 2016), 1 pages.

Digital Library

[30]

Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, 235--250. Retrieved from http://dl.acm.org/citation.cfm?id=1757112.1757137.

Digital Library

[31]

Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, Michael Spiegel, and Jan F. Prins. 2012. OpenMP task scheduling strategies for multicore NUMA systems. Int. J. High Perform. Comput. Appl. 26, 2 (May 2012), 110--124.

Digital Library

[32]

Peter Pirkelbauer and Nicholas Dzugan. 2015. The BLAZE Concurrent Library. Retrieved from http://iprogress.cis.uab.edu/blaze.

[33]

Peter Pirkelbauer, Reed Milewicz, and Juan Felipe Gonzalez. 2016. A portable lock-free bounded queue. In Proceedings of the 2016 16th International Conference on Algorithms and Architectures for Parallel Processing.

[34]

Peter Pirkelbauer, Amalee Wilson, Hadia Ahmed, and Reed Milewicz. 2017. Memory management for concurrent data structures on hardware transactional memory. In Proceedings of the 12th ACM SIGPLAN Workshop on Transactional Computing (Transact’17) and Workshop at the Symposium on Principles and Practice of Parallel Programming (PPoPP’17).

[35]

Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), 391--411.

Digital Library

[36]

Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. SIGPLAN Not. 52, 8 (Jan. 2017), 249--265.

Digital Library

[37]

Tao B. Schardl, I-Ting Angelina Lee, and Charles E. Leiserson. 2018. Brief announcement: Open Cilk. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA'18). ACM, 351--353.

Digital Library

[38]

S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, P. Carns, A. Castello, D. Genet, T. Herault, S. Iwasaki, P. Jindal, L. V. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, K. Taura, and P. Beckman. 2018. Argobots: A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. 29, 3 (Mar. 2018), 512--526.

[39]

Jaroslav Sevcik and Peter Sewell. 2011. C/C++11 mappings to processors. Retrieved from www.cl.cam.ac.uk/pes20/cpp/cpp0xmappings.html.

[40]

Bjarne Stroustrup and Gabriel Dos Reis. 2005. Supporting SELL for high-performance computing. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, Vol. 4339. Springer-Verlag, 458--465.

Digital Library

[41]

Hakan Sundell and Philippas Tsigas. 2005. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Proceedings of the International Conference on Principles of Distributed Systems (OPODIS’04), Lecture Notes in Computer Science, Vol. 3544. 240--255.

Digital Library

[42]

Peter Thoman, Kiril Dichev, Thomas Heller, Roman Iakymchuk, Xavier Aguilar, Khalid Hasanov, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Thomas Fahringer, Kostas Katrinis, Erwin Laure, and Dimitrios S. Nikolopoulos. 2018. A taxonomy of task-based parallel programming technologies for high-performance computing. J. Supercomput. 74, 4 (01 Apr. 2018), 1422--1434.

Digital Library

[43]

Peter Thoman, Peter Zangerl, and Thomas Fahringer. 2017. Task-parallel runtime system optimization using static compiler analysis. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, NY, 201--210.

Digital Library

[44]

J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI’10).

Digital Library

[45]

K. B. Wheeler, R. C. Murphy, and D. Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. 1--8.

[46]

Anthony Williams. 2012. C++ Concurrency in Action: Practical Multithreading. Manning Publ., Shelter Island, NY.

[47]

Chaoran Yang and John Mellor-Crummey. 2016. A wait-free queue as fast as fetch-and-add. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). ACM, New York, NY, Article 16, 13 pages.

Digital Library

Cited By

Reitz LHardenbicker KWerner TFohry C(2023)Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clustersParallel Computing10.1016/j.parco.2023.103020116:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.parco.2023.103020
Reitz LHardenbicker KFohry C(2023)Comparison of Load Balancing Schemes for Asynchronous Many-Task RuntimesParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_2(14-26)Online publication date: 27-Apr-2023
https://doi.org/10.1007/978-3-031-30445-3_2
Posner JReitz LFohry C(2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
https://doi.org/10.15803/ijnc.12.1_47
Show More Cited By

Index Terms

Blaze-Tasks: A Framework for Computing Parallel Reductions over Tasks
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Self-organization
      2. Shared memory algorithms

Recommendations

Global-view abstractions for user-defined reductions and scans
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Since APL, reductions and scans have been recognized as powerful programming concepts. Abstracting an accumulation loop (reduction) and an update loop (scan), the concepts have efficient parallel implementations based on the parallel prefix algorithm. ...
Automatic annotation of tasks in structured code
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

This paper describes the design and implementation of a suit of static analyses and code generation techniques to annotate programs with OpenMP pragmas for task parallelism. These techniques approximate the ranges covered by memory regions, bound ...
The tasks with effects model for safe concurrency
PPoPP '13

Today's widely-used concurrent programming models either provide weak safety guarantees, making it easy to write code with subtle errors, or are limited in the class of programs that they can express. We propose a new concurrent programming model based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 4

December 2018

706 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3284745

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 January 2019

Accepted: 01 November 2018

Revised: 01 October 2018

Received: 01 June 2018

Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,213
Total Downloads

Downloads (Last 12 months)187
Downloads (Last 6 weeks)22

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Reitz LHardenbicker KWerner TFohry C(2023)Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clustersParallel Computing10.1016/j.parco.2023.103020116:COnline publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.parco.2023.103020
Reitz LHardenbicker KFohry C(2023)Comparison of Load Balancing Schemes for Asynchronous Many-Task RuntimesParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_2(14-26)Online publication date: 27-Apr-2023
https://doi.org/10.1007/978-3-031-30445-3_2
Posner JReitz LFohry C(2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
https://doi.org/10.15803/ijnc.12.1_47
Ciesko JOlivier S(2022)Characterizing the Performance of Task Reductions in OpenMP 5.X ImplementationsOpenMP in a Modern World: From Multi-device Support to Meta Programming10.1007/978-3-031-15922-0_3(35-49)Online publication date: 27-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-15922-0_3

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents