Article

Automatically exploiting cross-invocation parallelism using runtime information

Authors:

David I. August,

Stephen R. Beard,

Nick P. Johnson,

Thomas B. JablinAuthors Info & Claims

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Pages 1 - 11

https://doi.org/10.1109/CGO.2013.6495001

Published: 23 February 2013 Publication History

Abstract

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of each loop invocation. When parallel code contains many loop invocations, synchronization can easily become a performance bottleneck. Some automatic techniques address this problem by exploiting cross-invocation parallelism. These techniques use static analysis to partition iterations among threads to avoid crossthread dependences. However, this partitioning is not always achievable at compile-time, because program input determines dependence patterns at run-time. By contrast, this paper proposes DOMORE, the first automatic parallelization technique that uses runtime information to exploit additional cross-invocation parallelism. Instead of partitioning iterations statically, DOMORE dynamically detects crossthread dependences and synchronizes only when necessary. DOMORE consists of a compiler and a runtime library. At compile time, DOMORE automatically parallelizes loops and inserts a custom runtime engine into programs. At run-time, the engine observes dependences and synchronizes iterations only when necessary. For six programs, DOMORE achieves a geomean loop speedup of 2.1x over parallel execution without cross-invocation parallelization and of 3.2 x over sequential execution on eight cores.

References

[1]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[2]

M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compilerassisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In PPOPP, 2009.

Digital Library

[3]

E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C++. In OOPSLA, 2009.

Digital Library

[4]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, 2008.

Digital Library

[5]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, 1995.

Digital Library

[6]

R. Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.

[7]

F. H. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test: Speculative parallelization of partially parallel loops. In IPDPS, 2002.

Digital Library

[8]

C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In PLDI, 2007.

Digital Library

[9]

R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. Unrolling loops containing task parallelism. In LCPC, 2009.

Digital Library

[10]

R. Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors. In ASPLOS, 1989.

Digital Library

[11]

L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA, 2004.

Digital Library

[12]

H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In LCPC, 1999.

Digital Library

[13]

M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993.

Digital Library

[14]

T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In EPIC, 2010.

[15]

T. Knight. An architecture for mostly functional languages. In LFP, 1986.

Digital Library

[16]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, 2004.

Digital Library

[17]

LLVM Test Suite Guide. http://llvm.org/docs/TestingGuide.html.

[18]

J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS, 2002.

Digital Library

[19]

M. Mehrara, J. Hao, P.-C. Hsu, and S. Mahlke. Parallelizing sequential applications on commodity hardware using a lowcost software transactional memory. In PLDI, 2009.

Digital Library

[20]

S. Moon, B. So, M. W. Hall, and B. R. Murphy. A case for combining compile-time and run-time parallelization. In LCR, 1998.

Digital Library

[21]

V. Nagarajan and R. Gupta. Speculative optimizations for parallel programs on multicores. In LCPC, 2009.

Digital Library

[22]

R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. 2006.

[23]

A. Nicolau, G. Li, A. V. Veidenbaum, and A. Kejariwal. Synchronization optimizations for efficient execution on multicores. In ICS, 2009.

Digital Library

[24]

NAS Parallel Benchmarks 3. http://www.nas.nasa.gov/Resources/Software/npb.html.

[25]

M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29, September 1995.

Digital Library

[26]

G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.

Digital Library

[27]

R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In SC, 1993.

Digital Library

[28]

P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In PLDI, 2011.

Digital Library

[29]

R. Rajwar and J. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. In MICRO, 2001.

Digital Library

[30]

L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537-576, 1995.

Digital Library

[31]

L. Rauchwerger and D. A. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE TPDS, 1999.

Digital Library

[32]

J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991.

Digital Library

[33]

N. Shavit and D. Touitou. Software transactional memory. In PODC, 1995.

Digital Library

[34]

Standard Performance Evaluation Corporation. http://www.spec.org.

[35]

C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In PPoPP, 1995.

Digital Library

[36]

M. Weiser. Program slicing. In ICSE, 1981.

Digital Library

[37]

M. J. Wolfe. Optimizing Compilers for Supercomputers. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, October 1982.

Digital Library

Cited By

Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Ain QAhmed SZafar AMehmood MWaheed AChen KZhao WMa Y(2018)Analysis of hotspot methods in JVM for best-effort run-time parallelizationProceedings of the 9th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3183586.3183607(60-65)Online publication date: 11-Jan-2018
https://dl.acm.org/doi/10.1145/3183586.3183607
Atre RJannesari AWolf FScheideler CHajiaghayi M(2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087592(363-365)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087592
Show More Cited By

Index Terms

Automatically exploiting cross-invocation parallelism using runtime information
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language features
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Speculatively Exploiting Cross-Invocation Parallelism
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For ...
Runtime loop optimizations for locality and parallelism
Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

February 2013

366 pages

ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
53
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mohammadi MYuki TCheshmi KDavis EHall MDehnavi MNandy POlschanowsky CVenkat AStrout MMcKinley KFisher K(2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314646
Ain QAhmed SZafar AMehmood MWaheed AChen KZhao WMa Y(2018)Analysis of hotspot methods in JVM for best-effort run-time parallelizationProceedings of the 9th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3183586.3183607(60-65)Online publication date: 11-Jan-2018
https://dl.acm.org/doi/10.1145/3183586.3183607
Atre RJannesari AWolf FScheideler CHajiaghayi M(2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087592(363-365)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087592
Mahjoub SVojoudi H(2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11227-016-1725-8
Streit KDoerfert JHammacher CZeller AHack S(2015)Generalized Task ParallelismACM Transactions on Architecture and Code Optimization10.1145/272316412:1(1-25)Online publication date: 2-Apr-2015
https://dl.acm.org/doi/10.1145/2723164
Huda ZJannesari AWolf F(2015)Using Template Matching to Infer Parallel Design PatternsACM Transactions on Architecture and Code Optimization10.1145/268890511:4(1-21)Online publication date: 9-Jan-2015
https://dl.acm.org/doi/10.1145/2688905
Ravishankar MEisenlohr JPouchet LRamanujam JRountev ASadayappan P(2014)Automatic parallelization of a class of irregular loops for distributed memory systemsACM Transactions on Parallel Computing10.1145/26602511:1(1-37)Online publication date: 3-Oct-2014
https://dl.acm.org/doi/10.1145/2660251

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents