Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/CGO.2013.6495001acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Automatically exploiting cross-invocation parallelism using runtime information

Published: 23 February 2013 Publication History

Abstract

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of each loop invocation. When parallel code contains many loop invocations, synchronization can easily become a performance bottleneck. Some automatic techniques address this problem by exploiting cross-invocation parallelism. These techniques use static analysis to partition iterations among threads to avoid crossthread dependences. However, this partitioning is not always achievable at compile-time, because program input determines dependence patterns at run-time. By contrast, this paper proposes DOMORE, the first automatic parallelization technique that uses runtime information to exploit additional cross-invocation parallelism. Instead of partitioning iterations statically, DOMORE dynamically detects crossthread dependences and synchronizes only when necessary. DOMORE consists of a compiler and a runtime library. At compile time, DOMORE automatically parallelizes loops and inserts a custom runtime engine into programs. At run-time, the engine observes dependences and synchronizes iterations only when necessary. For six programs, DOMORE achieves a geomean loop speedup of 2.1x over parallel execution without cross-invocation parallelization and of 3.2 x over sequential execution on eight cores.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002.
[2]
M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Compilerassisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In PPOPP, 2009.
[3]
E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C++. In OOPSLA, 2009.
[4]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, 2008.
[5]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, 1995.
[6]
R. Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.
[7]
F. H. Dang, H. Yu, and L. Rauchwerger. The R-LRPD test: Speculative parallelization of partially parallel loops. In IPDPS, 2002.
[8]
C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In PLDI, 2007.
[9]
R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. Unrolling loops containing task parallelism. In LCPC, 2009.
[10]
R. Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors. In ASPLOS, 1989.
[11]
L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA, 2004.
[12]
H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In LCPC, 1999.
[13]
M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993.
[14]
T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In EPIC, 2010.
[15]
T. Knight. An architecture for mostly functional languages. In LFP, 1986.
[16]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, 2004.
[17]
LLVM Test Suite Guide. http://llvm.org/docs/TestingGuide.html.
[18]
J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS, 2002.
[19]
M. Mehrara, J. Hao, P.-C. Hsu, and S. Mahlke. Parallelizing sequential applications on commodity hardware using a lowcost software transactional memory. In PLDI, 2009.
[20]
S. Moon, B. So, M. W. Hall, and B. R. Murphy. A case for combining compile-time and run-time parallelization. In LCR, 1998.
[21]
V. Nagarajan and R. Gupta. Speculative optimizations for parallel programs on multicores. In LCPC, 2009.
[22]
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. 2006.
[23]
A. Nicolau, G. Li, A. V. Veidenbaum, and A. Kejariwal. Synchronization optimizations for efficient execution on multicores. In ICS, 2009.
[24]
NAS Parallel Benchmarks 3. http://www.nas.nasa.gov/Resources/Software/npb.html.
[25]
M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29, September 1995.
[26]
G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.
[27]
R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In SC, 1993.
[28]
P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In PLDI, 2011.
[29]
R. Rajwar and J. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. In MICRO, 2001.
[30]
L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537-576, 1995.
[31]
L. Rauchwerger and D. A. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE TPDS, 1999.
[32]
J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991.
[33]
N. Shavit and D. Touitou. Software transactional memory. In PODC, 1995.
[34]
Standard Performance Evaluation Corporation. http://www.spec.org.
[35]
C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In PPoPP, 1995.
[36]
M. Weiser. Program slicing. In ICSE, 1981.
[37]
M. J. Wolfe. Optimizing Compilers for Supercomputers. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, October 1982.

Cited By

View all
  • (2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
  • (2018)Analysis of hotspot methods in JVM for best-effort run-time parallelizationProceedings of the 9th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3183586.3183607(60-65)Online publication date: 11-Jan-2018
  • (2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087592(363-365)Online publication date: 24-Jul-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
February 2013
366 pages
ISBN:9781467355247

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 February 2013

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Sparse computation data dependence simplification for efficient compiler-generated inspectorsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314646(594-609)Online publication date: 8-Jun-2019
  • (2018)Analysis of hotspot methods in JVM for best-effort run-time parallelizationProceedings of the 9th International Conference on E-Education, E-Business, E-Management and E-Learning10.1145/3183586.3183607(60-65)Online publication date: 11-Jan-2018
  • (2017)Brief AnnouncementProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087592(363-365)Online publication date: 24-Jul-2017
  • (2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
  • (2015)Generalized Task ParallelismACM Transactions on Architecture and Code Optimization10.1145/272316412:1(1-25)Online publication date: 2-Apr-2015
  • (2015)Using Template Matching to Infer Parallel Design PatternsACM Transactions on Architecture and Code Optimization10.1145/268890511:4(1-21)Online publication date: 9-Jan-2015
  • (2014)Automatic parallelization of a class of irregular loops for distributed memory systemsACM Transactions on Parallel Computing10.1145/26602511:1(1-37)Online publication date: 3-Oct-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media