research-article

Open access

Generalized Task Parallelism

Authors:

Johannes Doerfert,

Clemens Hammacher,

Andreas Zeller,

Sebastian HackAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 1

Article No.: 8, Pages 1 - 25

https://doi.org/10.1145/2723164

Published: 02 April 2015 Publication History

Abstract

Existing approaches to automatic parallelization produce good results in specific domains. Yet, it is unclear how to integrate their individual strengths to match the demands and opportunities of complex software. This lack of integration has both practical reasons, as integrating those largely differing approaches into one compiler would impose an engineering hell, as well as theoretical reasons, as no joint cost model exists that would drive the choice between parallelization methods.

By reducing the problem of generating parallel code from a program dependence graph to integer linear programming, <i>generalized task parallelization</i> integrates central aspects of existing parallelization approaches into a single unified framework. Implemented on top of LLVM, the framework seamlessly integrates enabling technologies such as speculation, privatization, and the realization of reductions.

Evaluating our implementation on various C programs from different domains, we demonstrate the effectiveness and generality of generalized task parallelization. On a quad-core machine with hyperthreading we achieve speedups of up to 4.6 ×.

Supplementary Material

TACO1201-08 (taco1201-08.pdf)

Slide deck associated with this paper

Download
2.76 MB

References

[1]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81.

Digital Library

[2]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. SIGPLAN Not. 30, 8 (Aug. 1995), 207--216.

Digital Library

[3]

Michael Burke, Ron Cytron, Jeanne Ferrante, and Wilson Hsieh. 1989. Automatic generation of nested, fork-join parallelism. J. Supercomput. 3, 2 (1989), 71--88.

[4]

Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. SIGARCH Comput. Archit. News 42, 3 (June 2014), 217--228.

Digital Library

[5]

Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 2012. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 84--93.

Digital Library

[6]

Michael K. Chen and Kunle Olukotun. 2003. The jrpm system for dynamically parallelizing java programs. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM Press, New York, NY, 434--446.

Digital Library

[7]

Daniel Cordes, Peter Marwedel, and Arindam Mallik. 2010. Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS’10). ACM, New York, NY, 267--276.

Digital Library

[8]

Matthew DeVuyst, Dean M. Tullsen, and Seon Wook Kim. 2011. Runtime parallelization of legacy code on a transactional memory system. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). ACM, New York, NY, 127--136.

Digital Library

[9]

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, Washington, DC, 124--131.

Digital Library

[10]

Tobias Edler von Koch and Björn Franke. 2014. Variability of data dependences and control flow. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). 180--189.

[11]

Yong hun Eom, Stephen Yang, James C. Jenista, and Brian Demsky. 2012. DOJ: Dynamically parallelizing object-oriented programs. SIGPLAN Not. 47, 8 (Feb. 2012), 85--96.

Digital Library

[12]

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3 (July 1987), 319--349.

Digital Library

[13]

Samuel Z. Guyer and Calvin Lin. 2005. Error checking with client-driven pointer analysis. Sci. Comput. Prog. 58, 1--2 (Oct. 2005), 83--114.

Digital Library

[14]

Ben Hardekopf and Calvin Lin. 2011. Flow-sensitive pointer analysis for millions of lines of code. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 289--298.

Digital Library

[15]

Ben Hertzberg and Kunle Olukotun. 2011. Runtime automatic speculative parallelization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 64--73.

Digital Library

[16]

Jialu Huang, Stephen R. Beard, Nick P. Johnson, Thomas B. Jablin, and David I. August. 2013. Automatically exploiting cross-invocation parallelism using runtime information. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, Washington, DC, USA, 1--11.

Digital Library

[17]

Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. 2010. Decoupled software pipelining creates parallelization opportunities. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). ACM, New York, NY, 121--130.

Digital Library

[18]

James Christopher Jenista, Yong hun Eom, and Brian Charles Demsky. 2011. OoOJava: Software out-of-order execution. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 57--68.

Digital Library

[19]

Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August. 2012. Speculative separation for privatization and reductions. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 359--370.

Digital Library

[20]

Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. 2007. Speculative thread decomposition through empirical optimization. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’07). ACM, New York, NY, 205--214.

Digital Library

[21]

Thomas Karcher and Victor Pankratius. 2011. Run-Time automatic performance tuning for multicore applications. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Lecture Notes in Computer Science, Vol. 6852. Springer, Berlin, 3--14.

Digital Library

[22]

Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, CA.

Digital Library

[23]

Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August. 2012. Automatic speculative DOALL for clusters. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 94--103.

Digital Library

[24]

Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, NY, 211--222.

Digital Library

[25]

Chris Lattner, Andrew Lenharth, and Vikram Adve. 2007. Making context-sensitive points-to analysis with heap cloning practical for the real world. SIGPLAN Not. 42, 6 (June 2007), 278--289.

Digital Library

[26]

Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. SIGPLAN Not. 44, 6 (June 2009), 166--176.

Digital Library

[27]

Samuel P. Midkiff. 2012. Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Morgan & Claypool Publishers.

Digital Library

[28]

Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 38). IEEE Computer Society, Washington, DC, 105--118.

Digital Library

[29]

Louis-Noël Pouchet. 2012. PolyBench/C: The Polyhedral Benchmark Suite. Retrieved from http://www.cs.ucla.edu/&sim;pouchet/software/polybench/.

[30]

Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. 2012. Parcae: A system for flexible parallel execution. SIGPLAN Not. 47, 6 (June 2012), 133--144.

Digital Library

[31]

Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’08). ACM, New York, NY, 114--123.

Digital Library

[32]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). IEEE Computer Society, Washington, DC, 177--188.

Digital Library

[33]

Lawrence Rauchwerger and David Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, NY, 218--232.

Digital Library

[34]

Radu Rugina and Martin Rinard. 1999. Automatic parallelization of divide and conquer algorithms. SIGPLAN Not. 34, 8 (May 1999), 72--83.

Digital Library

[35]

Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid Analysis: Static and dynamic memory reference analysis. Int. J. Parallel Program. 31, 4 (Aug. 2003), 251--283.

Digital Library

[36]

Vivek Sarkar. 1991. Automatic partitioning of a program dependence graph into parallel tasks. IBM J. Res. Dev. 35, 5--6 (Sept. 1991), 779--804.

Digital Library

[37]

Kevin Streit, Clemens Hammacher, Andreas Zeller, and Sebastian Hack. 2013. Sambamba: Runtime adaptive parallel execution. In Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems (ADAPT’13). ACM, New York, NY, Article 7, 6 pages.

Digital Library

[38]

Sid Ahmed Ali Touati, Julien Worms, and Sébastien Briais. 2013. The Speedup-Test: A statistical methodology for programme speedup analysis and computation. Concurr. Comp.: Pract. Exper. 25, 10 (2013), 1410--1426.

[39]

Peng Tu and David A. Padua. 1994. Automatic array privatization. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, London, UK, 500--521. http://dl.acm.org/citation.cfm&quest;id=645671.665384

Digital Library

[40]

Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07). IEEE Computer Society, Washington, DC, 49--59.

Digital Library

[41]

Hans Vandierendonck, Sean Rul, and Koen De Bosschere. 2010. The paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 389--400.

Digital Library

[42]

Hongtao Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. 2008. Uncovering hidden loop level parallelism in sequential applications. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). 290--301.

Cited By

Lee WLiu YLiu PMa SChoi HZhang XGupta RKandemir MJimborean AMoseley T(2019)White-box program tuningProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314889(122-135)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314889
Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Lee WLiu YLiu PMa SChoi HZhang XGupta R(2019)White-Box Program Tuning2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661177(122-135)Online publication date: Feb-2019
https://doi.org/10.1109/CGO.2019.8661177
Show More Cited By

Index Terms

Generalized Task Parallelism
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Retargetable compilers
      2. Source code generation
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Exploiting task and data parallelism on a multicomputer
PPOPP '93: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming

For many applications, achieving good performance on a private memory parallel computer requires exploiting data parallelism as well as task parallelism. Depending on the size of the input data set and the number of nodes (i.e., processors), different ...
A comparison of automatic parallelization tools/compilers on the SGI origin 2000
SC '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing

Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization ...
Automatic Extraction of Task-Level Parallelism for Heterogeneous MPSoCs
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel Processing

Heterogeneous multi-core platforms are increasingly attractive for embedded applications due to their adaptability and efficiency. This proliferation of heterogeneity demands new approaches for extracting thread level parallelism from sequential ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 1

April 2015

201 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2744295

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2015

Accepted: 01 January 2015

Revised: 01 December 2014

Received: 01 June 2014

Published in TACO Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
816
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)15

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee WLiu YLiu PMa SChoi HZhang XGupta RKandemir MJimborean AMoseley T(2019)White-box program tuningProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314889(122-135)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314889
Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Lee WLiu YLiu PMa SChoi HZhang XGupta R(2019)White-Box Program Tuning2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661177(122-135)Online publication date: Feb-2019
https://doi.org/10.1109/CGO.2019.8661177
Jannesari A(2018)Advances in Engineering Software for Multicore SystemsDependability Engineering10.5772/intechopen.72784Online publication date: 6-Jun-2018
https://doi.org/10.5772/intechopen.72784
Hammacher CStreit KZeller AHack SZaks AHermenegildo M(2016)Thread-level speculation with kernel supportProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892221(1-11)Online publication date: 17-Mar-2016
https://dl.acm.org/doi/10.1145/2892208.2892221
Huda ZAtre RJannesari AWolf F(2016)Automatic Parallel Pattern Detection in the Algorithm Structure Design Space2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2016.60(43-52)Online publication date: May-2016
https://doi.org/10.1109/IPDPS.2016.60

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents