research-article

Open access

Automatic Vectorization of Interleaved Data Revisited

Authors:

Andrew Anderson,

David GreggAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 4

Article No.: 50, Pages 1 - 25

https://doi.org/10.1145/2838735

Published: 08 December 2015 Publication History

Abstract

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-2 interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations, which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-2 and non--power-of-2 interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.

Supplementary Material

TACO1204-50 (taco1204-50.pdf)

Slide deck associated with this paper

Download
199.17 KB

References

[1]

Alfred V. Aho, Mahadevan Ganapathi, and Steven W. K. Tjiang. 1989. Code generation using tree matching and dynamic programming. ACM Transactions on Programming Languages and Systems (TOPLAS) 11, 4 (1989), 491--516.

Digital Library

[2]

A. Balachandran, Dhananjay M. Dhamdhere, and S. Biswas. 1990. Efficient retargetable code generation using bottom-up tree pattern matching. Computer Languages 15, 3 (1990), 127--140.

Digital Library

[3]

Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In MICRO. 201--212.

Digital Library

[4]

Albert Cohen, Sylvain Girbal, and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. 292--303.

[5]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In PLDI. 82--93.

Digital Library

[6]

Liza Fireman, Erez Petrank, and Ayal Zaks. 2007. New algorithms for SIMD alignment. In Compiler Construction. Springer, 1--15.

Digital Library

[7]

Franz Franchetti and Markus Puschel. 1993. A SIMD vectorizing compiler for digital signal processing algorithms. In Proceedings of the 1993 IEEE Vehicle Navigation and Information Systems Conference. 7.

[8]

Franz Franchetti and Markus Püschel. 2008. Generating SIMD vectorized permutations. In CC. 116--131.

Digital Library

[9]

Christopher W. Fraser, Robert R. Henry, and Todd A. Proebsting. 1992. BURG: Fast optimal instruction selection and tree parsing. ACM SIGPLAN Notices 27, 4 (1992), 68--76.

Digital Library

[10]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. 141--150.

Digital Library

[11]

Alexei Kudriavtsev and Peter Kogge. 2005. Generation of permutations for SIMD processors. SIGPLAN Not. 40, 7 (June 2005), 147--156.

Digital Library

[12]

Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In PLDI. 145--156.

Digital Library

[13]

Chuck L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308--323.

Digital Library

[14]

Ruby B. Lee. 1995. Accelerating multimedia with enhanced microprocessors. IEEE Micro 15, 2 (1995), 22--32.

Digital Library

[15]

Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut T. Kandemir. 2012. A compiler framework for extracting superword level parallelism. In PLDI. 347--358.

Digital Library

[16]

Saeed Maleki, Yaoqing Gao, María Jesús Garzarán, Tommy Wong, and David A. Padua. 2011. An Evaluation of vectorizing compilers. In PACT. 372--382.

Digital Library

[17]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. In PLDI. 132--143.

Digital Library

[18]

Gabriele Paoloni. 2010. How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. Intel Corporation. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf.

[19]

Yongjun Park, Sangwon Seo, Hyunchul Park, Hyoun Kyu Cho, and Scott Mahlke. 2012. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. In ACM SIGARCH Computer Architecture News, Vol. 40. ACM, 363--374.

Digital Library

[20]

Gang Ren, Peng Wu, and David Padua. 2006. Optimizing data permutations for SIMD devices. In ACM SIGPLAN Notices, Vol. 41. ACM, 118--131.

Digital Library

[21]

Kenneth Rosen. 2011. Discrete Mathematics and Its Applications (7th ed.). McGraw-Hill.

[22]

Thomas Schaub, Simon Moll, Ralf Karrenberg, and Sebastian Hack. 2015. The impact of the SIMD width on control-flow and memory divergence. ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (2015), 54.

Digital Library

[23]

Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In IEEE PACT. 45--55.

Digital Library

[24]

Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 165--175.

Digital Library

[25]

Deependra Talla, Lizy Kurian John, and Doug Burger. 2003. Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers 52, 8 (2003), 1015--1031.

Digital Library

[26]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel^® Xeon Phi. Springer, 167--188.

[27]

R. Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 1 (2001), 3--35.

[28]

Michael Joseph Wolfe. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley.

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3631709
Behroozi APark SMahlke SEgger BSmith A(2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517767
Feng JHe YTao Q(2021)Evaluation of Compilers’ Capability of Automatic Vectorization Based on Source Code AnalysisScientific Programming10.1155/2021/32646242021(1-15)Online publication date: 30-Nov-2021
https://doi.org/10.1155/2021/3264624
Show More Cited By

Index Terms

Automatic Vectorization of Interleaved Data Revisited
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 PLDI Conference

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...
Auto-vectorization of interleaved data for SIMD
PLDI '06: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 4

January 2016

848 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2836331

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2015

Accepted: 01 October 2015

Revised: 01 September 2015

Received: 01 April 2015

Published in TACO Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Irish Software Engineering Research Centre (www.lero.ie)
Science Foundation Ireland

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
909
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)43

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3631709
Behroozi APark SMahlke SEgger BSmith A(2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517767
Feng JHe YTao Q(2021)Evaluation of Compilers’ Capability of Automatic Vectorization Based on Source Code AnalysisScientific Programming10.1155/2021/32646242021(1-15)Online publication date: 30-Nov-2021
https://doi.org/10.1155/2021/3264624
Alves RKaxiras SBlack-Schaffer D(2021)Early Address PredictionACM Transactions on Architecture and Code Optimization10.1145/345888318:3(1-22)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3458883
Thuerck DWeber NBifulco R(2021)Flynn’s ReconciliationACM Transactions on Architecture and Code Optimization10.1145/345835718:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3458357
Carvalho DSeznec A(2021)Understanding Cache CompressionACM Transactions on Architecture and Code Optimization10.1145/345720718:3(1-27)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3457207
Heirman WEyerman SBois KHur I(2021)Automatic Sublining for Efficient Sparse Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/345214118:3(1-23)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1145/3452141
Tripathy DAbdolrashidi ABhuyan LZhou LWong D(2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3451164
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2
Porpodas VRocha RBrevnov EGóes LMattson TKandemir MJimborean AMoseley T(2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314897
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents