research-article

Public Access

Speculatively Exploiting Cross-Invocation Parallelism

Authors:

Prakash Prabhu,

Thomas B. Jablin,

Soumyadeep Ghosh,

Sotiris Apostolakis,

David I. AugustAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 207 - 221

https://doi.org/10.1145/2967938.2967959

Published: 11 September 2016 Publication History

Abstract

Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism to overcome this problem. Using static analysis, they partition iterations among threads to avoid cross-thread dependences. However, this approach may fail if dependence pattern information is not available at compile time. To address this limitation, this work proposes SpecCross--the first automatic parallelization technique to exploit cross-invocation parallelism using speculation. With speculation, iterations from different loop invocations can execute concurrently, and the program synchronizes only on misspeculation. This allows SpecCross to adapt to dependence patterns that only manifest on particular inputs at runtime. Evaluation on eight programs shows that SpecCross achieves a geomean speedup of 3.43x over parallel execution without cross-invocation parallelization.

References

[1]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[2]

E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C++. In OOPSLA, 2009.

Digital Library

[3]

M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: techniques for efficiently managing shared state. In PLDI, 2011.

Digital Library

[4]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, 2008.

Digital Library

[5]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In PPoPP, 1995.

Digital Library

[6]

D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., 1997.

Digital Library

[7]

S. Campanoni, T. Jones, G. Holloway, V. J. Reddi, G.-Y. Wei, and D. Brooks. Helix: automatic parallelization of irregular programs for chip multiprocessing. In CGO, 2012.

Digital Library

[8]

C. D. Carothers and B. K. Szymanski. Checkpointing multithreaded programs. Dr. Dobbs, August 2002.

[9]

L. Ceze, J. Tuck, J. Torrellas, and C. Cascaval. Bulk disambiguation of speculative threads in multiprocessors. In ISCA, 2006.

Digital Library

[10]

G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In ICPP, 2008.

Digital Library

[11]

R. Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.

[12]

W. R. Dieter and J. E. Lumpp Jr. User-level checkpointing for linuxthreads programs. In USENIX, FREENIX Track, 2001.

Digital Library

[13]

C. Ding, X. Shen, K. Kelsey, C. Tice, R. Huang, and C. Zhang. Software behavior oriented parallelization. In PLDI, 2007.

Digital Library

[14]

A. J. Dorta, C. Rodriguez, F. de Sande, and A. Gonzalez-Escribano. The OpenMP source code repository. In PDP, 2005.

Digital Library

[15]

R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. Unrolling loops containing task parallelism. In LCPC, 2009.

Digital Library

[16]

S. Ghosh, Y. Park, and A. Raman. Enabling efficient alias speculation. In LCTES '15.

Digital Library

[17]

T. C. Grosser. Enabling polyhedral optimizations in llvm. Diploma thesis, Department of Informatics and Mathematics, University of Passau, Germany, April 2011.

[18]

R. Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors. In ASPLOS, 1989.

Digital Library

[19]

L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In ISCA, 2004.

Digital Library

[20]

H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In LCPC, 1999.

Digital Library

[21]

M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993.

Digital Library

[22]

J. Huang, A. Raman, Y. Zhang, T. B. Jablin, T.-H. Hung, and D. I. August. Decoupled Software Pipelining Creates Parallelization Opportunities. In CGO, 2010.

Digital Library

[23]

K. Z. Ibrahim and G. T. Byrd. On the exploitation of value predication and producer identification to reduce barrier synchronization time. In IPDPS, 2001.

Digital Library

[24]

LLVM Test Suite Guide.\http://llvm.org/docs/TestingGuide.html.

[25]

E. P. Markatos and T. J. LeBlanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In SC, 1992.

Digital Library

[26]

J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS, 2002.

Digital Library

[27]

P. E. Mckenney. Memory barriers: a hardware view for software hackers, 2009.

[28]

M. Mehrara, J. Hao, P.-C. Hsu, and S. Mahlke. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In PLDI, 2009.

Digital Library

[29]

V. Nagarajan and R. Gupta. ECMon: exposing cache events for monitoring. In ISCA, 2009.

Digital Library

[30]

V. Nagarajan and R. Gupta. Speculative optimizations for parallel programs on multicores. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing, LCPC'09, 2010.

Digital Library

[31]

NAS Parallel Benchmarks 3.\http://www.nas.nasa.gov/Resources/Software/npb.html.

[32]

C. E. Oancea and A. Mycroft. Software thread-level speculation: an optimistic library implementation. In IWMSE, 2008.

Digital Library

[33]

M. F. P. O'Boyle, L. Kervella, and F. Bodin. Synchronization minimization in a SPMD execution model. J. Parallel Distrib. Comput., 29, September 1995.

Digital Library

[34]

G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.

Digital Library

[35]

C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Transactions on Computers, C-36(12), December 1987.

Digital Library

[36]

R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In SC, 1993.

Digital Library

[37]

L.-N. Pouchet. PolyBench: the Polyhedral Benchmark suite. http://www-roc.inria.fr/ pouchet/software/polybench/download.

[38]

P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In PLDI, 2011.

Digital Library

[39]

L. Rauchwerger, N. M. Amato, and D. A. Padua. A scalable method for run-time loop parallelization. International Journal of Parallel Programming (IJPP), 26:537--576, 1995.

Digital Library

[40]

L. Rauchwerger and D. A. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE TPDS, 1999.

Digital Library

[41]

A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In IPDPS, 2008.

[42]

J. Saltz, R. Mirchandaney, and R. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40, 1991.

Digital Library

[43]

N. Shavit and D. Touitou. Software transactional memory. In PODC, 1995.

Digital Library

[44]

M. F. Spear, M. M. Michael, and C. von Praun. RingSTM: scalable transactions with a single atomic instruction. In SPAA, 2008.

Digital Library

[45]

Standard Performance Evaluation Corporation.\http://www.spec.org.

[46]

J. G. Steffan, C. Colohan, A. Zhai, and T. C. Mowry. The STAMPede approach to thread-level speculation. ACM Transactions on Computer Systems, 2005.

Digital Library

[47]

P. Swamy and C. Vipin. Minimum dependence distance tiling of nested loops with non-uniform dependences. In IPDPS, 1994.

[48]

C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, 2008.

[49]

C.-W. Tseng. Compiler optimizations for eliminating barrier synchronization. In PPoPP, 1995.

Digital Library

[50]

T. Tzen and L. Ni. Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Transactions on Parallel and Distributed Systems, 4(1), January 1993.

Digital Library

[51]

M. J. Wolfe. Optimizing Compilers for Supercomputers. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, October 1982.

Digital Library

[52]

L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, and D. A. Wood. Log™-SE: decoupling hardware transactional memory from caches. In HPCA, 2007.

Digital Library

[53]

N. Yonezawa, K. Wada, and T. Aida. Barrier elimination based on access dependency analysis for openmp. In Parallel and Distributed Processing and Applications. Springer Berlin / Heidelberg, 2006.

Digital Library

[54]

J. Zhao, J. Shirako, V. K. Nandivada, and V. Sarkar. Reducing task creation and termination overhead in explicitly parallel programs. In PACT, 2010.

Digital Library

[55]

L. Ziarek, S. Jagannathan, M. Fluet, and U. A. Acar. Speculative n-way barriers. In DAMP, 2008.

Digital Library

Cited By

Apostolakis SXu ZTan ZChan GCampanoni SAugust DDonaldson ATorlak E(2020)SCAF: a speculation-aware collaborative dependence analysis frameworkProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3385412.3386028(638-654)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3385412.3386028

Index Terms

Speculatively Exploiting Cross-Invocation Parallelism
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

Automatically exploiting cross-invocation parallelism using runtime information
CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of ...
A Compiler Approach for Exploiting Partial SIMD Parallelism

Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed ...
Parallel-stage decoupled software pipelining
CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

In recent years, the microprocessor industry has embraced chip multiprocessors (CMPs), also known as multi-core architectures, as the dominant design paradigm. For existing and new applications to make effective use of CMPs, it is desirable that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)9

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Apostolakis SXu ZTan ZChan GCampanoni SAugust DDonaldson ATorlak E(2020)SCAF: a speculation-aware collaborative dependence analysis frameworkProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3385412.3386028(638-654)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3385412.3386028

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents