research-article

Open access

A Compiler Approach for Exploiting Partial SIMD Parallelism

Authors:

Jingling XueAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 1

Article No.: 11, Pages 1 - 26

https://doi.org/10.1145/2886101

Published: 28 March 2016 Publication History

Abstract

Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called Paver (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C/C++/Fortran applications with partial SIMD parallelism, Paver achieves significantly better kernel and whole-program speedups than LLVM on both Intel’s AVX and ARM’s NEON.

References

[1]

Sverre Aarseth. 2015. N-Body Simulation. Retrieved February 9, 2016, from http://www.ast.cam.ac.uk/research/nbody.

[2]

Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO’43). IEEE, Los Alamitos, CA, 201--212.

Digital Library

[3]

Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2, 65--98.

Digital Library

[4]

Siddhartha Chatterjee, Vibhor V. Jain, Alvin R. Lebeck, Shyam Mundhra, and Mithuna Thottethodi. 1999. Nonlinear array layouts for hierarchical memory systems. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, NY, 444--453.

Digital Library

[5]

Elena Demikhovsky. 2015. Implemented cost model for masked load/store operations. Retrieved February 9, 2016, from http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150119/254753.html

[6]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’04). ACM, New York, NY, 82--93.

Digital Library

[7]

Agner Fog. 2014. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved February 9, 2016, from http://www.agner.org/optimize/instruction_tables.pdf.

[8]

Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 341--352.

Digital Library

[9]

Ronald W. Green. 2012. Utilizing Full Vectors and Use of Option -Qopt-Assume-Safe-Padding. Retrieved February 9, 2016, from https://software.intel.com/en-us/articles/utilizing-full-vectors.

[10]

Q. Huang, J. Xue, and X. Vera. 2003. Code tiling for improving the cache performance of PDE solvers. In Proceedings of the 2003 International Conference on Parallel Processing. 615--624.

[11]

Intel. 2014. Intel^® 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-030. Retrieved February 9, 2016, from http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html.

[12]

Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 363--374.

Digital Library

[13]

Ralf Karrenberg. 2015. Automatic SIMD Vectorization of SSA-Based Control Flow Graphs. Springer Vieweg.

Digital Library

[14]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, Los Alamitos, CA, 141--150.

Digital Library

[15]

Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 55--64.

Digital Library

[16]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, 127--138.

[17]

Stefan Kral, Franz Franchetti, Juergen Lorenz, and Christoph W. Ueberhuber. 2003. SIMD vectorization of straight line FFT code. In Euro-Par 2003 Parallel Processing. Lecture Notes in Computer Science, Vol. 2790. Springer, 251--260.

[18]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’00). ACM, New York, NY, 145--156.

Digital Library

[19]

Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and detecting memory address congruence. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE, Los Alamitos, CA, 18--29.

[20]

Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A compiler framework for extracting superword level parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 347--358.

Digital Library

[21]

Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, Los Alamitos, CA, 372--382.

[22]

Mantevo. 2015. The Mantevo Benchmark Suite. Available at http://mantevo.org.

[23]

Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD vectorization of fast Fourier transforms for the Larrabee and AVX instruction sets. In Proceedings of the International Conference on Supercomputing (ICS’11). ACM, New York, NY, 265--274.

Digital Library

[24]

Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE, Los Alamitos, CA, 151--160.

Digital Library

[25]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’06). ACM, New York, NY, 132--143.

Digital Library

[26]

Yongjun Park, Sangwon Seo, Hyunchul Park, Hyoun Kyu Cho, and Scott Mahlke. 2012. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 363--374.

Digital Library

[27]

Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). IEEE, Los Alamitos, CA, 190--201.

[28]

Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE, Los Alamitos, CA, 1--10.

[29]

Gang Ren, Peng Wu, and David Padua. 2006. Optimizing data permutations for SIMD devices. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’06). ACM, New York, NY, 118--131.

Digital Library

[30]

Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In Proceedings of GCC Developers’ Summit (GCC Developers’ Summit’07). 131--142.

[31]

Jaewook Shin. 2007. Introducing control flow into vectorized code. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT’07). IEEE, Los Alamitos, CA, 280--291.

Digital Library

[32]

Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE, Los Alamitos, CA, 45--55.

[33]

Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE, Los Alamitos, CA, 165--175.

Digital Library

[34]

Narasimhan Sreraman and Ramaswamy Govindarajan. 2000. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming 28, 4, 363--400.

[35]

Majedul Haque Sujon, R. Clint Whaley, and Qing Yi. 2013. Vectorization past dependent branches through speculation. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 353--362.

Digital Library

[36]

Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). IEEE, Los Alamitos, CA, 327--337.

Digital Library

[37]

John Tsiombikas. 2015. C-Ray Raytracing Benchmark Results. Retrieved February 9, 2016, from http://www.futuretech.blinkenlights.nl/c-ray.html.

[38]

Michael Wolfe. 1989. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing’89). 655--664.

Digital Library

[39]

Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. 2005. An integrated simdization framework using virtual vectors. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS’05). ACM, New York, NY, 169--178.

Digital Library

[40]

Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic, Norwell, MA.

Digital Library

[41]

Sen Ye, Yulei Sui, and Jingling Xue. 2014. Region-based selective flow-sensitive pointer analysis. In Proceedings of the 21st International Symposium on Static Analysis (SAS’14). 319--336.

[42]

Hao Zhou and Jingling Xue. 2016. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’16).

Digital Library

[43]

Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York, NY.

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Show More Cited By

Index Terms

A Compiler Approach for Exploiting Partial SIMD Parallelism
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Exploiting mixed SIMD parallelism by reducing data reorganization overhead
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Existing loop vectorization techniques can exploit either intra- or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting ...
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Refactoring Loops with Nested IFs for SIMD Extensions Without Masked Instructions
Euro-Par 2018: Parallel Processing Workshops
Abstract
Most CPUs in heterogeneous systems are now equipped with SIMD (Single Instruction Multiple Data) extensions that operate on short vectors in parallel to enable high performance. Refactoring programs for such systems relies on vectorization, i.e., ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1

April 2016

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2899032

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2016

Accepted: 01 January 2016

Revised: 01 November 2015

Received: 01 August 2015

Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,222
Total Downloads

Downloads (Last 12 months)185
Downloads (Last 6 weeks)14

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Yao JZhou HZhang YLi YFeng CChen SChen JWang YHu Q(2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070941
Liu BLaird ATsang WMahjour BDehnavi MKloeckner AMoreira J(2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569663
Behroozi APark SMahlke SEgger BSmith A(2022)Loner: utilizing the CPU vector datapath to process scalar integer dataProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517767(205-217)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517767
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2
de Fine Licht JKwasniewski GHoefler TNeuendorffer SShannon L(2020)Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level SynthesisProceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3373087.3375296(244-254)Online publication date: 23-Feb-2020
https://dl.acm.org/doi/10.1145/3373087.3375296
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents