research-article

Open access

Revec: program rejuvenation through revectorization

Authors:

Charith Mendis,

Saman AmarasingheAuthors Info & Claims

CC 2019: Proceedings of the 28th International Conference on Compiler Construction

Pages 29 - 41

https://doi.org/10.1145/3302516.3307357

Published: 16 February 2019 Publication History

Abstract

Modern microprocessors are equipped with Single Instruction Multiple Data (SIMD) or vector instructions which expose data level parallelism at a fine granularity. Programmers exploit this parallelism by using low-level vector intrinsics in their code. However, once programs are written using vector intrinsics of a specific instruction set, the code becomes non-portable. Modern compilers are unable to analyze and retarget the code to newer vector instruction sets. Hence, programmers have to manually rewrite the same code using vector intrinsics of a newer generation to exploit higher data widths and capabilities of new instruction sets. This process is tedious, error-prone and requires maintaining multiple code bases. We propose Revec, a compiler optimization pass which revectorizes already vectorized code, by retargeting it to use vector instructions of newer generations. The transformation is transparent, happening at the compiler intermediate representation level, and enables performance portability of hand-vectorized code.

Revec can achieve performance improvements in real-world performance critical kernels. In particular, Revec achieves geometric mean speedups of 1.160× and 1.430× on fast integer unpacking kernels, and speedups of 1.145× and 1.195× on hand-vectorized x265 media codec kernels when retargeting their SSE-series implementations to use AVX2 and AVX-512 vector instructions respectively. We also extensively test Revec’s impact on 216 intrinsic-rich implementations of image processing and stencil kernels relative to hand-retargeting.

References

[1]

Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491–542.

Digital Library

[2]

ARM. 2013. ARM Programmer Guide. http://infocenter.arm.com/help/index. jsp?topic=/com.arm.doc.den0018a/index.html

[3]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 82–93.

Digital Library

[4]

996853

[5]

Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph W Ueberhuber. 2005. Efficient utilization of SIMD extensions. Proc. IEEE 93, 2 (2005), 409–425.

[6]

S. Fu, D. Hong, J. Wu, P. Liu, and W. Hsu. 2015. SIMD Code Translation in an Enhanced HQEMU. In 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). 507–514.

Digital Library

[7]

Vijay Ganesh, Tim Leek, and Martin Rinard. 2009. Taint-based directed whitebox fuzzing. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 474–484.

Digital Library

[8]

Ding-Yong Hong, Yu-Ping Liu, Sheng-Yu Fu, Jan-Jan Wu, and Wei-Chung Hsu. 2018. Improving SIMD Parallelism via Dynamic Binary Translation. ACM Trans. Embed. Comput. Syst. 17, 3, Article 61 (Feb. 2018), 27 pages. 1145/3173456

Digital Library

[9]

IBM. 2006. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group (2006).

[10]

Yermalayeu Ihar, Antonenka Mikhail, Radchenko Andrey, Dmitry Fedorov, and Kirill Matsaberydze. 2016. Simd Library for Image Processing. http://ermig1979.

[11]

github.io/Simd/index.html

[12]

MulticoreWare Inc. 2018. x265 HEVC Encoder / H.265 Video Codec. http: //x265.org

[13]

Matthias Kretz and Volker Lindenstruth. 2012. Vc: A C++ library for explicit vectorization. Software: Practice and Experience 42, 11, 1409–1430.

Digital Library

[14]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156.

Digital Library

[15]

Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 18–29. http://dl.acm.org/citation.cfm? id=645989.674329

Digital Library

[16]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO ’04). IEEE Computer Society, Washington, DC, USA, 75–. http://dl.acm.org/citation.cfm?id=977395.977673

Digital Library

[17]

Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. Journal of Software Practice and Experience (2015).

Digital Library

[18]

Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In International Symposium on Code Generation and Optimization (CGO’06). 12 pp.–280.

Digital Library

[19]

Saeed Maleki, Yaoqing Gao, Maria J Garzar, Tommy Wong, David A Padua, et al. 2011. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 372–382.

Digital Library

[20]

Stanislav Manilov, Björn Franke, Anthony Magrath, and Cedric Andrieu. 2015. Free Rider: A Tool for Retargeting Platform-Specific Intrinsic Functions. SIGPLAN Not. 50, 5, Article 5 (June 2015), 10 pages.

Digital Library

[21]

Daniel S. McFarlin, Volodymyr Arbatov, Franz Franchetti, and Markus Püschel. 2011. Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets. In Proceedings of the International Conference on Supercomputing (ICS ’11). ACM, New York, NY, USA, 265–274. 1145/1995896.1995938

Digital Library

[22]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 132–143.

Digital Library

[23]

Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 2–11.

Digital Library

[24]

Stuart Oberman, Greg Favor, and Fred Weber. 1999. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro 19, 2 (March 1999), 37–48.

Digital Library

[25]

Gilles Pokam, Stéphane Bihan, Julien Simonnet, and François Bodin. 2004. SWARP: a retargetable preprocessor for multimedia instructions. Concurrency and Computation: Practice and Experience 16, 2-3 (2004), 303–318.

Digital Library

[26]

M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, Jianxin Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93, 2 (Feb 2005), 232–275.

[27]

B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1988. Global Value Numbers and Redundant Computations. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’88). ACM, New York, NY, USA, 12–27.

Digital Library

[28]

Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-Aware SLP in GCC. In Proceedings of the GCC Developers’ Summit. 131–142.

[29]

N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. Int. J. Parallel Program. 28, 4 (Aug. 2000), 363–400.

[30]

Haichuan Wang, Peng Wu, Ilie Gabriel Tanase, Mauricio J Serrano, and José E Moreira. 2014. Simple, portable and fast SIMD intrinsic programming: generic simd library. In Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing. ACM, 9–16.

Digital Library

[31]

Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vectorizing Database Column Scans with Complex Predicates. In Fourth International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS) at VLDB.

[32]

Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1, 385–394.

Digital Library

Cited By

Lin JYang YLai HLee J(2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678151
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Basso MRosà AOmini LBinder WVerbrugge CLhoták OShen X(2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580265
Show More Cited By

Index Terms

Revec: program rejuvenation through revectorization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Register saturation in instruction level parallelism

The registers constraints are usually taken into account during the scheduling pass of an acyclic data dependence graph (DAG): any schedule of the instructions inside a basic block must bound the register requirement under a certain limit. In this work, ...
DITVA: Dynamic Inter-Thread Vectorization Architecture
Abstract
In the Single-Program Multiple-Data (SPMD) programming model, threads of an application exhibit very similar control flows and often execute the same instructions, but on different data. In this paper, we propose the Dynamic Inter-...
Highlights
- An architecture that dynamically vectorizes SPMD binaries is presented.
- The ...
MiC: Multi-level Characterization and Optimization of GPGPU Kernels

Graphics processing units (GPUs)¹ have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CC 2019: Proceedings of the 28th International Conference on Compiler Construction

February 2019

204 pages

ISBN:9781450362771

DOI:10.1145/3302516

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Milind Kulkarni
Purdue University, USA

Copyright © 2019 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Conference

CC '19

CC '19: 28th International Conference on Compiler Construction

February 16 - 17, 2019

DC, Washington, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
443
Total Downloads

Downloads (Last 12 months)102
Downloads (Last 6 weeks)12

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin JYang YLai HLee J(2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678151
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Basso MRosà AOmini LBinder WVerbrugge CLhoták OShen X(2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580265
Khadem AFujiki DTalati NMahlke SDas R(2023)Vector-Processing for Mobile Devices: Benchmark and Analysis2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00036(15-27)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00036
Feng JHe YTao QMa H(2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1832522
Liu BLaird ATsang WMahjour BDehnavi MKloeckner AMoreira J(2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569663
Rocha RSprokholt DFink MGouicem RSpink TChakraborty SBhatotia PJhala RDillig I(2022)Lasagne: a static binary translator for weak memory model architecturesProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523719(888-902)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523719
Wu JDong JFang RZhao ZGong XWang WZuo DTitzer BXu HZhang I(2021)Effective exploitation of SIMD resources in cross-ISA virtualizationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454016(84-97)Online publication date: 7-Apr-2021
https://dl.acm.org/doi/10.1145/3453933.3454016
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents