Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Blaze-Tasks: A Framework for Computing Parallel Reductions over Tasks

Published: 08 January 2019 Publication History

Abstract

Compared to threads, tasks are a more fine-grained alternative. The task parallel programming model offers benefits in terms of better performance portability and better load-balancing for problems that exhibit nonuniform workloads. A common scenario of task parallel programming is that a task is recursively decomposed into smaller sub-tasks. Depending on the problem domain, the number of created sub-tasks may be nonuniform, thereby creating potential for significant load imbalances in the system. Dynamic load-balancing mechanisms will distribute the tasks across available threads. The final result of a computation may be modeled as a reduction over the results of all sub-tasks.
This article describes a simple, yet effective prototype framework, Blaze-Tasks, for task scheduling and task reductions on shared memory architectures. The framework has been designed with lock-free techniques and generic programming principles in mind. Blaze-Tasks is implemented entirely in C++17 and is thus portable. To load-balance the computation, Blaze-Tasks uses task stealing. To manage contention on a task pool, the number of lock-free attempts to steal a task depends on the distance between thief and pool owner and the estimated number of tasks in a victim’s pool. This article evaluates the Blaze framework on Intel and IBM dual-socket systems using nine benchmarks and compares its performance with other task parallel frameworks. While Cilk outperforms Blaze on Intel on most benchmarks, the evaluation shows that Blaze is competitive with OpenMP and other library-based implementations. On IBM, the experiments show that Blaze outperforms other approaches on most benchmarks.

References

[1]
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129.
[2]
Mark Batty, Kayvan Memarian, Scott Owens, Susmit Sarkar, and Peter Sewell. 2012. Clarifying and compiling C/C++ concurrency: From C++11 to POWER. SIGPLAN Not. 47, 1 (Jan. 2012), 509--520.
[3]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sep. 1999), 720--748.
[4]
Hans-J. Boehm. 2002. Bounding space usage of conservative garbage collectors. In Proceeedings of the 2002 ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.
[5]
Hans-J. Boehm. 2016. Temporarily Discourage Memory_order_consume. Technical Report P0371R1. JTC1/SC22/WG21 C++ Standards Committee.
[6]
Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). ACM, New York, NY, 68--78.
[7]
François Broquedis, Thierry Gautier, and Vincent Danjean. 2012. LIBKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World (IWOMP’12). Springer-Verlag, Berlin, 102--115.
[8]
Trevor Alexander Brown. 2015. Reclaiming memory for lock-free data structures: There has to be a better way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing (PODC’15). ACM, New York, NY, 261--270.
[9]
J. Ciesko, S. Mateo, X. Teruel, X. Martorell, E. Ayguade, J. Labarta, A. Duran, B. De Supinski, S. Olivier, K. Li, and A. Eichenberger. 2015. Towards task-parallel reductions in OpenMP. In Proceedings of the International Workshop on OpenMP. Springer, 189--201.
[10]
de Supinski, Bronis and Klemm, Michael (eds.). 2017. OpenMP Technical Report 6:Version 5.0 Preview 2. Technical Report. OpenMP Architecture Review Board.
[11]
Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2006. Lock-free dynamically resizable arrays. In Proceedings of the International Conference on Principles of Distributed Systems (OPODIS’06), Alexander A. Shvartsman (Ed.), Lecture Notes in Computer Science, Vol. 4305. Springer, 142--156.
[12]
James C. Dehnert and Alexander A. Stepanov. 2000. Fundamentals of generic programming. In Selected Papers from the International Seminar on Generic Programming. Springer-Verlag, London, UK, 1--11.
[13]
Noah Evans, Stephen L. Olivier, Richard Barrett, and George Stelle. 2017. Scheduling chapel tasks with qthreads on manycore: A tale of two schedulers. In Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’17). ACM, New York, NY, Article 4, 8 pages.
[14]
Steven Feldman and Damian Dechev. 2015. A wait-free multi-producer multi-consumer ring buffer. SIGAPP Appl. Comput. Rev. 15, 3 (Oct. 2015), 59--71.
[15]
Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and other cilk++ hyperobjects. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures (SPAA’09). ACM, New York, NY, 79--90.
[16]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, New York, NY, 212--223.
[17]
Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. 2007. Performance of memory reclamation for lockless synchronization. J. Parallel Distrib. Comput. 67, 12 (Dec. 2007), 1270--1285.
[18]
Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. 2005. Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst. 23, 2 (2005), 146--196.
[19]
Maurice Herlihy and Nir Shavit. 2012. The Art of Multiprocessor Programming (revised 1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.
[20]
Intel Corp.2018. Reference for Intel Threading Building Blocks, Version 1.0. Retrieved on May 5, 2018 from http://threadingbuildingblocks.org/.
[21]
ISO/IEC 14882:2017(E) International Standard. 2017. Programming Language C++. JTC1/SC22/WG21 - The C++ Standards Committee.
[22]
Jaakko Järvi and John Freeman. 2010. C++ lambda expressions and closures. Sci. Comput. Program. 75, 9 (Sep. 2010), 762--772.
[23]
Jim Lambers. 2009. Lecture Notes on Adaptive Quadrature. Retrieved April 22, 2018 from http://www.math.usm.edu/lambers/mat460/fall09/lecture30.
[24]
Charles E. Leiserson. 2009. The cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference (DAC’09). ACM, New York, NY, 522--527.
[25]
Maged M. Michael. 2002. Safe memory reclamation for dynamic lock-free objects using atomic reads and writes. In Proceedings of the 21st Annual Symposium on Principles of Distributed Computing (PODC’02). ACM Press, New York, NY, 21--30.
[26]
Maged M. Michael. 2003. CAS-based lock-free algorithm for shared deques. In Proceedings of the European Conference on Parallel Processing (Euro-Par’03), Lecture Notes in Computer Science, Vol. 2790. 651--660.
[27]
Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing. ACM, 267--275.
[28]
Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical work stealing on manycore clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models.
[29]
Ananya Muddukrishna, Peter A. Jonsson, and Mats Brorsson. 2016. Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors. Sci. Program. 2015, Article 5 (Jan. 2016), 1 pages.
[30]
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, 235--250. Retrieved from http://dl.acm.org/citation.cfm?id=1757112.1757137.
[31]
Stephen L. Olivier, Allan K. Porterfield, Kyle B. Wheeler, Michael Spiegel, and Jan F. Prins. 2012. OpenMP task scheduling strategies for multicore NUMA systems. Int. J. High Perform. Comput. Appl. 26, 2 (May 2012), 110--124.
[32]
Peter Pirkelbauer and Nicholas Dzugan. 2015. The BLAZE Concurrent Library. Retrieved from http://iprogress.cis.uab.edu/blaze.
[33]
Peter Pirkelbauer, Reed Milewicz, and Juan Felipe Gonzalez. 2016. A portable lock-free bounded queue. In Proceedings of the 2016 16th International Conference on Algorithms and Architectures for Parallel Processing.
[34]
Peter Pirkelbauer, Amalee Wilson, Hadia Ahmed, and Reed Milewicz. 2017. Memory management for concurrent data structures on hardware transactional memory. In Proceedings of the 12th ACM SIGPLAN Workshop on Transactional Computing (Transact’17) and Workshop at the Symposium on Principles and Practice of Parallel Programming (PPoPP’17).
[35]
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), 391--411.
[36]
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. SIGPLAN Not. 52, 8 (Jan. 2017), 249--265.
[37]
Tao B. Schardl, I-Ting Angelina Lee, and Charles E. Leiserson. 2018. Brief announcement: Open Cilk. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA'18). ACM, 351--353.
[38]
S. Seo, A. Amer, P. Balaji, C. Bordage, G. Bosilca, A. Brooks, P. Carns, A. Castello, D. Genet, T. Herault, S. Iwasaki, P. Jindal, L. V. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, Y. Sun, K. Taura, and P. Beckman. 2018. Argobots: A lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. 29, 3 (Mar. 2018), 512--526.
[39]
Jaroslav Sevcik and Peter Sewell. 2011. C/C++11 mappings to processors. Retrieved from www.cl.cam.ac.uk/pes20/cpp/cpp0xmappings.html.
[40]
Bjarne Stroustrup and Gabriel Dos Reis. 2005. Supporting SELL for high-performance computing. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, Vol. 4339. Springer-Verlag, 458--465.
[41]
Hakan Sundell and Philippas Tsigas. 2005. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Proceedings of the International Conference on Principles of Distributed Systems (OPODIS’04), Lecture Notes in Computer Science, Vol. 3544. 240--255.
[42]
Peter Thoman, Kiril Dichev, Thomas Heller, Roman Iakymchuk, Xavier Aguilar, Khalid Hasanov, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Thomas Fahringer, Kostas Katrinis, Erwin Laure, and Dimitrios S. Nikolopoulos. 2018. A taxonomy of task-based parallel programming technologies for high-performance computing. J. Supercomput. 74, 4 (01 Apr. 2018), 1422--1434.
[43]
Peter Thoman, Peter Zangerl, and Thomas Fahringer. 2017. Task-parallel runtime system optimization using static compiler analysis. In Proceedings of the Computing Frontiers Conference (CF’17). ACM, New York, NY, 201--210.
[44]
J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI’10).
[45]
K. B. Wheeler, R. C. Murphy, and D. Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. 1--8.
[46]
Anthony Williams. 2012. C++ Concurrency in Action: Practical Multithreading. Manning Publ., Shelter Island, NY.
[47]
Chaoran Yang and John Mellor-Crummey. 2016. A wait-free queue as fast as fetch-and-add. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). ACM, New York, NY, Article 16, 13 pages.

Cited By

View all
  • (2023)Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clustersParallel Computing10.1016/j.parco.2023.103020116:COnline publication date: 1-Jul-2023
  • (2023)Comparison of Load Balancing Schemes for Asynchronous Many-Task RuntimesParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_2(14-26)Online publication date: 27-Apr-2023
  • (2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 4
December 2018
706 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3284745
Issue’s Table of Contents
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 January 2019
Accepted: 01 November 2018
Revised: 01 October 2018
Received: 01 June 2018
Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Tasks
  2. lock-free algorithms
  3. reductions

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)187
  • Downloads (Last 6 weeks)22
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Lifeline-based load balancing schemes for Asynchronous Many-Task runtimes in clustersParallel Computing10.1016/j.parco.2023.103020116:COnline publication date: 1-Jul-2023
  • (2023)Comparison of Load Balancing Schemes for Asynchronous Many-Task RuntimesParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_2(14-26)Online publication date: 27-Apr-2023
  • (2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
  • (2022)Characterizing the Performance of Task Reductions in OpenMP 5.X ImplementationsOpenMP in a Modern World: From Multi-device Support to Meta Programming10.1007/978-3-031-15922-0_3(35-49)Online publication date: 27-Sep-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media