Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1854273.1854298acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Reducing task creation and termination overhead in explicitly parallel programs

Published: 11 September 2010 Publication History

Abstract

There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup 66.7x, 11.25x and 23.1x respectively on each platform and 4.6x, 2.1x and 6.4x performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.

References

[1]
}}E. Allan et al. The Fortress language specification version 0.618. Technical report, Sun Microsystems, April 2005.
[2]
}}S. P. Amarasinghe and M. S. Lam. Communication Optimization and Code Generation for Distributed Memory Machines. In Proceedings of the conference on Programming language design and implementation, pages 126--138. ACM, 1993.
[3]
}}R. Barik et al. Experiences with an SMP implementation for X10 based on the Java concurrency utilities. In Proceedings of the Workshop on Programming Models for Ubiqui- tous Parallelism, Seattle, Washington, 2006.
[4]
}}G. Bikshandi et al. Efficient, portable implementation of asynchronous multi-place programs. In Proceedings of the symposium on Principles and practice of parallel programming, 2009.
[5]
}}R. D. Blumofe et al. CILK: An efficient multithreaded runtime system. Proceedings of Symposium on Principles and Practice of Parallel Programming, pages 207--216, 1995.
[6]
}}R. Cytron, J. Lipkis, and E. Schonberg. A Compiler-Assisted Approach to SPMD Execution. Supercomputing, Nov 1990.
[7]
}}P. C. Diniz and M. C. Rinard. Synchronization transformations for parallel computing. In Proceedings of the ACM Symposium on the Principles of Programming Languages, pages 187--200. ACM, 1997.
[8]
}}D. H. Bailey et al. The nas parallel benchmarks, 1994.
[9]
}}P. Charles et al. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the conference on Object oriented programming, systems, languages, and applications, pages 519--538, 2005.
[10]
}}R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. Unrolling Loops Containing Task Parallelism. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, Sep 2009.
[11]
}}Andy Georges et al. Statistically Rigorous Java Performance Evaluation. SIGPLAN Not., 42(10):57--76, 2007.
[12]
}}Habanero Java. http://habanero.rice.edu/hj, Dec 2009.
[13]
}}Cray Inc. The Chapel language specification version 0.4. Technical report, Cray Inc., February 2005.
[14]
}}The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.
[15]
}}J.Shirako, D.M.Peixotto, V.Sarkar, and W.N.Scherer III. Phasers: a unified deadlock-free construct for collective and point-to-point synchronization. In Proceedings of the 22nd annual international conference on Supercomputing, pages 277--288, New York, USA, 2008. ACM.
[16]
}}K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
[17]
}}Calvin Lin and Lawrence Snyder. Principles of Parallel Programming. Addison-Wesley, 2008.
[18]
}}A. Nicolau, G. Li, A.V. Veidenbaum, and A. Kejariwal. Synchronization optimizations for efficient execution on multi-cores. In Proceedings of the 23rd international conference on Supercomputing, pages 169--180, New York, NY, USA, 2009. ACM.
[19]
}}R.J.Cytron, J.T.Lipkis, and E.T.Schonberg. A compiler-assisted approach to SPMD execution. In Proceedings of Supercomputing, 1990.
[20]
}}Jun Shirako, Jisheng Zhao, V. Krishna Nandivada, and Vivek Sarkar. Chunking parallel loops in the presence of synchronization. In ICS, pages 181--192, 2009.
[21]
}}C. Tseng. Compiler optimizations for eliminating barrier synchronization. In Proceedings of the symposium on Principles and practice of parallel programming, pages 144--155, New York, NY, USA, 1995. ACM.
[22]
}}R. Vallée-Rai et al. Soot - a Java Optimization Framework. In Proceedings of CASCON 1999, pages 125--135, 1999.
[23]
}}K. Yelick et al. Productivity and performance using partitioned global address space languages. In Proceedings of the international workshop on Parallel symbolic computation, pages 24--32, New York, USA, 2007. ACM.
[24]
}}J. Zhao and V. Sarkar. A hierarchical region-based static single assignment form. Technical Report TR09-9, Department of Computer Science, Rice University, December 2009.

Cited By

View all
  • (2024)When Is Parallelism Fearless and Zero-Cost with Rust?Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659966(27-40)Online publication date: 17-Jun-2024
  • (2019)Analysis and Optimization of Task Granularity on the Java Virtual MachineACM Transactions on Programming Languages and Systems10.1145/333849741:3(1-47)Online publication date: 16-Jul-2019
  • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
  • Show More Cited By

Index Terms

  1. Reducing task creation and termination overhead in explicitly parallel programs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
    September 2010
    596 pages
    ISBN:9781450301787
    DOI:10.1145/1854273
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 September 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. barriers
    2. ideal parallelism
    3. optimization
    4. redundant tasks
    5. useful parallelism

    Qualifiers

    • Research-article

    Conference

    PACT '10
    Sponsor:
    • IFIP WG 10.3
    • IEEE CS TCPP
    • SIGARCH
    • IEEE CS TCAA

    Acceptance Rates

    Overall Acceptance Rate 121 of 471 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)When Is Parallelism Fearless and Zero-Cost with Rust?Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659966(27-40)Online publication date: 17-Jun-2024
    • (2019)Analysis and Optimization of Task Granularity on the Java Virtual MachineACM Transactions on Programming Languages and Systems10.1145/333849741:3(1-47)Online publication date: 16-Jul-2019
    • (2019)PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307350(2-16)Online publication date: 16-Feb-2019
    • (2018)A Preliminary Study of Compiler Transformations for Graph Applications on the Emu SystemProceedings of the Workshop on Memory Centric High Performance Computing10.1145/3286475.3286481(37-44)Online publication date: 11-Nov-2018
    • (2018)Scheduling Parallel Computations by Work StealingInternational Journal of Parallel Programming10.1007/s10766-016-0484-846:2(173-197)Online publication date: 1-Apr-2018
    • (2017)Lifting Barriers Using Parallel Polyhedral Regions2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00046(338-347)Online publication date: Dec-2017
    • (2016)Speculatively Exploiting Cross-Invocation ParallelismProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967959(207-221)Online publication date: 11-Sep-2016
    • (2015)Revisiting loop transformations with x10 clocksProceedings of the ACM SIGPLAN Workshop on X1010.1145/2771774.2771778(1-6)Online publication date: 14-Jun-2015
    • (2014)Compiler techniques for massively scalable implicit task parallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.30(299-310)Online publication date: 16-Nov-2014
    • (2014)Improving the Performance of X10 Programs by Clock RemovalCompiler Construction10.1007/978-3-642-54807-9_7(113-132)Online publication date: 2014
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media