Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Automatic Vectorization of Interleaved Data Revisited

Published: 08 December 2015 Publication History

Abstract

Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-2 interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations, which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-2 and non--power-of-2 interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.

Supplementary Material

TACO1204-50 (taco1204-50.pdf)
Slide deck associated with this paper

References

[1]
Alfred V. Aho, Mahadevan Ganapathi, and Steven W. K. Tjiang. 1989. Code generation using tree matching and dynamic programming. ACM Transactions on Programming Languages and Systems (TOPLAS) 11, 4 (1989), 491--516.
[2]
A. Balachandran, Dhananjay M. Dhamdhere, and S. Biswas. 1990. Efficient retargetable code generation using bottom-up tree pattern matching. Computer Languages 15, 3 (1990), 127--140.
[3]
Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In MICRO. 201--212.
[4]
Albert Cohen, Sylvain Girbal, and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. 292--303.
[5]
Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In PLDI. 82--93.
[6]
Liza Fireman, Erez Petrank, and Ayal Zaks. 2007. New algorithms for SIMD alignment. In Compiler Construction. Springer, 1--15.
[7]
Franz Franchetti and Markus Puschel. 1993. A SIMD vectorizing compiler for digital signal processing algorithms. In Proceedings of the 1993 IEEE Vehicle Navigation and Information Systems Conference. 7.
[8]
Franz Franchetti and Markus Püschel. 2008. Generating SIMD vectorized permutations. In CC. 116--131.
[9]
Christopher W. Fraser, Robert R. Henry, and Todd A. Proebsting. 1992. BURG: Fast optimal instruction selection and tree parsing. ACM SIGPLAN Notices 27, 4 (1992), 68--76.
[10]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. 141--150.
[11]
Alexei Kudriavtsev and Peter Kogge. 2005. Generation of permutations for SIMD processors. SIGPLAN Not. 40, 7 (June 2005), 147--156.
[12]
Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In PLDI. 145--156.
[13]
Chuck L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308--323.
[14]
Ruby B. Lee. 1995. Accelerating multimedia with enhanced microprocessors. IEEE Micro 15, 2 (1995), 22--32.
[15]
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut T. Kandemir. 2012. A compiler framework for extracting superword level parallelism. In PLDI. 347--358.
[16]
Saeed Maleki, Yaoqing Gao, María Jesús Garzarán, Tommy Wong, and David A. Padua. 2011. An Evaluation of vectorizing compilers. In PACT. 372--382.
[17]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. In PLDI. 132--143.
[18]
Gabriele Paoloni. 2010. How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. Intel Corporation. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf.
[19]
Yongjun Park, Sangwon Seo, Hyunchul Park, Hyoun Kyu Cho, and Scott Mahlke. 2012. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. In ACM SIGARCH Computer Architecture News, Vol. 40. ACM, 363--374.
[20]
Gang Ren, Peng Wu, and David Padua. 2006. Optimizing data permutations for SIMD devices. In ACM SIGPLAN Notices, Vol. 41. ACM, 118--131.
[21]
Kenneth Rosen. 2011. Discrete Mathematics and Its Applications (7th ed.). McGraw-Hill.
[22]
Thomas Schaub, Simon Moll, Ralf Karrenberg, and Sebastian Hack. 2015. The impact of the SIMD width on control-flow and memory divergence. ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (2015), 54.
[23]
Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In IEEE PACT. 45--55.
[24]
Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 165--175.
[25]
Deependra Talla, Lizy Kurian John, and Doug Burger. 2003. Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers 52, 8 (2003), 1015--1031.
[26]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi. Springer, 167--188.
[27]
R. Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 1 (2001), 3--35.
[28]
Michael Joseph Wolfe. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley.

Cited By

View all
  • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
  • (2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
  • (2021)Evaluation of Compilers’ Capability of Automatic Vectorization Based on Source Code AnalysisScientific Programming10.1155/2021/32646242021(1-15)Online publication date: 30-Nov-2021
  • Show More Cited By

Index Terms

  1. Automatic Vectorization of Interleaved Data Revisited

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
    January 2016
    848 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2836331
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 December 2015
    Accepted: 01 October 2015
    Revised: 01 September 2015
    Received: 01 April 2015
    Published in TACO Volume 12, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SIMD
    2. Vectorization
    3. data permutation
    4. interleaving

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Irish Software Engineering Research Centre (www.lero.ie)
    • Science Foundation Ireland

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)198
    • Downloads (Last 6 weeks)43
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
    • (2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
    • (2021)Evaluation of Compilers’ Capability of Automatic Vectorization Based on Source Code AnalysisScientific Programming10.1155/2021/32646242021(1-15)Online publication date: 30-Nov-2021
    • (2021)Early Address PredictionACM Transactions on Architecture and Code Optimization10.1145/345888318:3(1-22)Online publication date: 8-Jun-2021
    • (2021)Flynn’s ReconciliationACM Transactions on Architecture and Code Optimization10.1145/345835718:3(1-26)Online publication date: 8-Jun-2021
    • (2021)Understanding Cache CompressionACM Transactions on Architecture and Code Optimization10.1145/345720718:3(1-27)Online publication date: 8-Jun-2021
    • (2021)Automatic Sublining for Efficient Sparse Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/345214118:3(1-23)Online publication date: 10-May-2021
    • (2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
    • (2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
    • (2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media