Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3168807acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Look-ahead SLP: auto-vectorization in the presence of commutative operations

Published: 24 February 2018 Publication History

Abstract

Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.

References

[1]
[n. d.]. DragonEgg: Using LLVM as a GCC backend. http://dragonegg.llvm.org. ([n. d.]).
[2]
John R Allen and Ken Kennedy. 1982. PFC: A program to convert Fortran to parallel form. Technical Report 82-6. Department of Mathematical Sciences, Rice University.
[3]
John Randy Allen and Ken Kennedy. 1987. Automatic Translation of Fortran Programs to Vector Form. Tranactions on Programming Languages and Systems (TOPLAS) 9, 4 (1987).
[4]
Olaf Bachmann, Paul S Wang, and Eugene V Zima. 1994. Chains of recurrences: Amethod to expedite the evaluation of closed-form functions. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC).
[5]
R. Barik, Jisheng Zhao, and V. Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[6]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46-55.
[7]
James Davies, Christopher Huson, Thomas Macke, Bruce Leasure, and Michael Wolfe. 1986. The KAP/S-1- An advanced source-to-source vectorizer for the S-1 Mark IIa supercomputer. In Proceedings of the International Conference on Parallel Processing.
[8]
Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).
[9]
Free Software Foundation. 2015. GCC: GNU Compiler Collection. http://gcc.gnu.org. (2015).
[10]
Intel Corporation. 2007. IA-32 Architectures Optimization Reference Manual. (2007).
[11]
Christoforos E Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste Asanovic, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly Keeton, R. Thomas, N. Treuhaft, and K. Yelick. 1997. Scalable processors in the billion-transistor era: IRAM. Computer 30, 9 (1997).
[12]
D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence Graphs and Compiler Optimizations. In Proceedings of the Symposium on Principles of Programming Languages.
[13]
S. Larsen and S. Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).
[14]
C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).
[15]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (2008).
[16]
J. Liu, Y. Zhang, O. Jang, W. Ding, and M. Kandemir. 2012. A compiler framework for extracting superword level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).
[17]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).
[18]
Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: revisited for short SIMD architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT).
[19]
Wilfried Oed. 1992. Cray Y-MP C90: System features and early benchmark results. Parallel Comput. 18, 8 (1992).
[20]
Y. Park, S. Seo, H. Park, H.K. Cho, and S. Mahlke. 2012. SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[21]
Vasileios Porpodas. 2017. SuperGraph-SLP Auto-Vectorization. In 2017 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 330-342.
[22]
Vasileios Porpodas and Timothy M Jones. 2015. Throttling automatic vectorization: When less is more. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 432-444.
[23]
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).
[24]
Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI).
[25]
I. Rosen, D. Nuzman, and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit.
[26]
Richard M Russell. 1978. The CRAY-1 computer system. Commun. ACM 21, 1 (1978).
[27]
J. Shin, M. Hall, and J. Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).
[28]
SPEC. 2014. Standard Performance Evaluation Corp Benchmarks. http://www.spec.org. (2014).
[29]
Michael Wolfe. 1988. Vector optimization vs. vectorization. In Supercomputing. Springer.
[30]
Michael Joseph Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley.
[31]
Hao Zhou and Jingling Xue. 2016. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization (TACO) (2016).
[32]
Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 59-69.

Cited By

View all
  • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 15-Dec-2023
  • (2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
  • (2022)Loop rolling for code size reductionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741256(217-229)Online publication date: 2-Apr-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization
February 2018
377 pages
ISBN:9781450356176
DOI:10.1145/3179541
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Auto-Vectorization
  2. SIMD
  3. SLP

Qualifiers

  • Research-article

Conference

CGO '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)3
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 15-Dec-2023
  • (2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
  • (2022)Loop rolling for code size reductionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741256(217-229)Online publication date: 2-Apr-2022
  • (2021)HyFM: function merging for freeProceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3461648.3463852(110-121)Online publication date: 22-Jun-2021
  • (2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
  • (2020)Effective function merging in the SSA formProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3385412.3386030(854-868)Online publication date: 11-Jun-2020
  • (2020)Vectorization-aware loop unrolling with seed forwardingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377890(1-13)Online publication date: 22-Feb-2020
  • (2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
  • (2019)Function merging by sequence alignmentProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314892(149-163)Online publication date: 16-Feb-2019
  • (2019)Super-Node SLP: Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2019.8661192(206-216)Online publication date: Feb-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media