research-article

Look-ahead SLP: auto-vectorization in the presence of commutative operations

Authors:

Vasileios Porpodas,

Rodrigo C. O. Rocha,

Luís F. W. GóesAuthors Info & Claims

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 163 - 174

https://doi.org/10.1145/3168807

Published: 24 February 2018 Publication History

Abstract

Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.

References

[1]

[n. d.]. DragonEgg: Using LLVM as a GCC backend. http://dragonegg.llvm.org. ([n. d.]).

[2]

John R Allen and Ken Kennedy. 1982. PFC: A program to convert Fortran to parallel form. Technical Report 82-6. Department of Mathematical Sciences, Rice University.

[3]

John Randy Allen and Ken Kennedy. 1987. Automatic Translation of Fortran Programs to Vector Form. Tranactions on Programming Languages and Systems (TOPLAS) 9, 4 (1987).

Digital Library

[4]

Olaf Bachmann, Paul S Wang, and Eugene V Zima. 1994. Chains of recurrences: Amethod to expedite the evaluation of closed-form functions. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC).

Digital Library

[5]

R. Barik, Jisheng Zhao, and V. Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[6]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46-55.

Digital Library

[7]

James Davies, Christopher Huson, Thomas Macke, Bruce Leasure, and Michael Wolfe. 1986. The KAP/S-1- An advanced source-to-source vectorizer for the S-1 Mark IIa supercomputer. In Proceedings of the International Conference on Parallel Processing.

[8]

Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[9]

Free Software Foundation. 2015. GCC: GNU Compiler Collection. http://gcc.gnu.org. (2015).

[10]

Intel Corporation. 2007. IA-32 Architectures Optimization Reference Manual. (2007).

[11]

Christoforos E Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste Asanovic, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly Keeton, R. Thomas, N. Treuhaft, and K. Yelick. 1997. Scalable processors in the billion-transistor era: IRAM. Computer 30, 9 (1997).

Digital Library

[12]

D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence Graphs and Compiler Optimizations. In Proceedings of the Symposium on Principles of Programming Languages.

Digital Library

[13]

S. Larsen and S. Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[14]

C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[15]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (2008).

Digital Library

[16]

J. Liu, Y. Zhang, O. Jang, W. Ding, and M. Kandemir. 2012. A compiler framework for extracting superword level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[17]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[18]

Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: revisited for short SIMD architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[19]

Wilfried Oed. 1992. Cray Y-MP C90: System features and early benchmark results. Parallel Comput. 18, 8 (1992).

Digital Library

[20]

Y. Park, S. Seo, H. Park, H.K. Cho, and S. Mahlke. 2012. SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[21]

Vasileios Porpodas. 2017. SuperGraph-SLP Auto-Vectorization. In 2017 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 330-342.

[22]

Vasileios Porpodas and Timothy M Jones. 2015. Throttling automatic vectorization: When less is more. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 432-444.

Digital Library

[23]

Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).

[24]

Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).

Digital Library

[25]

I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit.

[26]

Richard M Russell. 1978. The CRAY-1 computer system. Commun. ACM 21, 1 (1978).

Digital Library

[27]

J. Shin, M. Hall, and J. Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[28]

SPEC. 2014. Standard Performance Evaluation Corp Benchmarks. http://www.spec.org. (2014).

[29]

Michael Wolfe. 1988. Vector optimization vs. vectorization. In Supercomputing. Springer.

Digital Library

[30]

Michael Joseph Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley.

Digital Library

[31]

Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization (TACO) (2016).

Digital Library

[32]

Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 59-69.

Digital Library

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3631709
Feng JHe YTao QMa H(2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1832522
Rocha RPetoumenos PFranke BBhatotia PO'Boyle MLee J(2022)Loop rolling for code size reductionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741256(217-229)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741256
Show More Cited By

Index Terms

Look-ahead SLP: auto-vectorization in the presence of commutative operations
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

All you need is superword-level parallelism: systematic control-flow vectorization with SLP
PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Superword-level parallelism (SLP) vectorization is a proven technique for vectorizing straight-line code. It works by replacing independent, isomorphic instructions with equivalent vector instructions. Larsen and Amarasinghe originally proposed using ...
Vectorization-aware loop unrolling with seed forwarding
CC 2020: Proceedings of the 29th International Conference on Compiler Construction

Loop unrolling is a widely adopted loop transformation, commonly used for enabling subsequent optimizations. Straight-line-code vectorization (SLP) is an optimization that benefits from unrolling. SLP converts isomorphic instruction sequences into ...
VW-SLP: auto-vectorization with adaptive vector width
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Auto-vectorization techniques allow the compiler to automatically generate SIMD vector code out of scalar code. SLP is a commonly-used algorithm for converting straight-line code into vector code, which complements the loop-based traditional ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
408
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)3

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3631709
Feng JHe YTao QMa H(2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1832522
Rocha RPetoumenos PFranke BBhatotia PO'Boyle MLee J(2022)Loop rolling for code size reductionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741256(217-229)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741256
Rocha RPetoumenos PWang ZCole MHazelwood KLeather HHenkel JLiu X(2021)HyFM: function merging for freeProceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463852(110-121)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3461648.3463852
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2
Rocha RPetoumenos PWang ZCole MLeather HDonaldson ATorlak E(2020)Effective function merging in the SSA formProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3385412.3386030(854-868)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3385412.3386030
Rocha RPorpodas VPetoumenos PGóes LWang ZCole MLeather HPouchet LJimborean A(2020)Vectorization-aware loop unrolling with seed forwardingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377890(1-13)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3377555.3377890
Porpodas VRocha RBrevnov EGóes LMattson TKandemir MJimborean AMoseley T(2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314897
Rocha RPetoumenos PWang ZCole MLeather HKandemir MJimborean AMoseley T(2019)Function merging by sequence alignmentProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314892(149-163)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314892
Porpodas VRocha RBrevnov EGoes LMattson T(2019)Super-Node SLP: Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661192(206-216)Online publication date: Feb-2019
https://doi.org/10.1109/CGO.2019.8661192
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents