Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/MICRO.2010.38acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Efficient Selection of Vector Instructions Using Dynamic Programming

Published: 04 December 2010 Publication History

Abstract

Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of anauto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel{\em compile-time efficient dynamic programming-based} vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) {\em scalar packing} explores opportunities of packing multiple scalar variables into short vectors, (2)judicious use of {\em shuffle} and {\em horizontal} vector operations, when possible, and (3) {\em algebraic reassociation} expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71\% on an Intel Xeon processor, compared tonon-vectorized execution, with a modest increase in compile-time in the range from 0.87\% to 9.992\%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Super word Level Parallelization (SLP) algorithm from~\cite{larsen00}, shows that our approach yields a performance improvement of up to 13.78\% relative to SLP.

References

[1]
Jikes RVM. http://jikesrvm.org/.
[2]
GCC compiler. http://gcc.gnu.org/, 2004.
[3]
A. V. Aho and S. C. Johnson. Optimal code generation for expression trees. In STOC '75: Proceedings of seventh annual ACM symposium on Theory of computing, pages 207-217, New York, NY, USA, 1975. ACM.
[4]
Intel Advanced Vector Extensions Programming Reference., 2008. Available at http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVX-Programming-Reference-31943302.pdf/.
[5]
Aart J. C. Bik. Applying Multimedia Extensions for Maximum Performance. Intel Press., 2004.
[6]
John Bruno and Ravi Sethi. Code generation for a one-register machine. J. ACM, 23(3):502-510, 1976.
[7]
Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales. Altivec extension to powerpc accelerates media processing. IEEE Micro, 20(2):85-95, 2000.
[8]
Alexandre E. Eichenberger, Peng Wu, and Kevin O'Brien. Vectorization for SIMD architectures with alignment constraints. SIGPLAN Not., 39(6):82-93, 2004.
[9]
D. H. Bailey et al. The NAS parallel benchmarks, 1994.
[10]
M.G. Burke et. al. The Jalapeño Dynamic Optimizing Compiler for Java. In ACM Java Grande Conference, June 1999.
[11]
Free Software Foundation. Auto-vectorization in GCC. GCC, 2004.
[12]
Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference, 1997. Available at http://developer.intel.com/design/PentiumII/manuals/.
[13]
The Java Grande Forum benchmark suite. http://www.epcc.ed.ac.uk/javagrande/javag.html.
[14]
Ken Kennedy and John R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
[15]
David Ryan Koes and Seth Copen Goldstein. Near-optimal instruction selection on dags. In CGO '08, pages 45-54, New York, NY, USA, 2008.
[16]
S. Kral. FFT Specific Compilation on IBM Blue Gene. PhD thesis, Vienna University of Technology, June 2006.
[17]
S. Kral, F. Franchetti, J. Lorenz, and C. W. Ueberhuber. Fft compiler techniques. In Proceedings of International Conference on Compiler Construction (CC) 2004, pages 217-231. LNCS.
[18]
Andreas Krall and Sylvain Lelait. Compilation techniques for multimedia processors. Int. J. Parallel Program., 28(4):347-361, 2000.
[19]
Alexei Kudriavtsev and Peter Kogge. Generation of permutations for SIMD processors. In LCTES '05, pages 147-156, New York, NY, USA, 2005. ACM.
[20]
Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, pages 145-156, New York, NY, USA, 2000. ACM Press.
[21]
Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe. Exploiting vector parallelism in software pipelined loops. In MICRO 38, pages 119- 129, Washington, DC, USA, 2005. IEEE Computer Society.
[22]
Samuel Larsen, Emmett Witchel, and Saman Amarasinghe. Increasing and detecting memory address congruence. In In International Conference on Parallel Architectures and Compilation Techniques, pages 18-29, 2002.
[23]
Bachega et. al. Leonardo. A high-performance SIMD floating point unit for Bluegene/L: Architecture, compilation, and algorithm design. In PACT '04, pages 85-96, Washington, DC, USA, 2004. IEEE Computer Society.
[24]
Rainer Leupers. Code selection for media processors with SIMD instructions, 2000.
[25]
Alex Peleg and Uri Weiser. MMX technology extension to the intel architecture. IEEE Micro, 16(4):42-50, 1996.
[26]
Ivan Pryanishnikov, Andreas Krall, and Nigel Horspool. Compiler optimizations for processors with SIMD instructions. Software--Practice and Experience, 37(1):93-113, 2007.
[27]
Vivek Sarkar, Mauricio J. Serrano, and Barbara B. Simons. Register-sensitive selection, duplication, and sequencing of instructions. In ICS '01: Proceedings of the 15th international conference on Supercomputing, pages 277-288, New York, NY, USA, 2001.
[28]
R. Vallée-Rai et al. SOOT - a Java Optimization Framework. In Proceedings of CASCON 1999, pages 125-135, 1999.

Cited By

View all
  • (2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
  • (2018)Improving SIMD Parallelism via Dynamic Binary TranslationACM Transactions on Embedded Computing Systems10.1145/317345617:3(1-27)Online publication date: 12-Feb-2018
  • (2018)Look-ahead SLP: auto-vectorization in the presence of commutative operationsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168807(163-174)Online publication date: 24-Feb-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '43: Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
December 2010
542 pages
ISBN:9780769542997

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 04 December 2010

Check for updates

Author Tags

  1. Dynamic Optimization
  2. Dynamic Programming
  3. Instruction Selection
  4. Vectorization

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
  • (2018)Improving SIMD Parallelism via Dynamic Binary TranslationACM Transactions on Embedded Computing Systems10.1145/317345617:3(1-27)Online publication date: 12-Feb-2018
  • (2018)Look-ahead SLP: auto-vectorization in the presence of commutative operationsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168807(163-174)Online publication date: 24-Feb-2018
  • (2018)Loop-Oriented Pointer Analysis for Automatic SIMD VectorizationACM Transactions on Embedded Computing Systems10.1145/316836417:2(1-31)Online publication date: 30-Jan-2018
  • (2018)Automated Compiler Optimization of Multiple Vector Loads/StoresInternational Journal of Parallel Programming10.1007/s10766-016-0485-746:2(471-503)Online publication date: 1-Apr-2018
  • (2017)Improving the effectiveness of searching for isomorphic chains in superword level parallelismProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3124554(718-729)Online publication date: 14-Oct-2017
  • (2016)FlexVec: auto-vectorization for irregular loopsACM SIGPLAN Notices10.1145/2980983.290811151:6(697-710)Online publication date: 2-Jun-2016
  • (2016)FlexVec: auto-vectorization for irregular loopsProceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2908080.2908111(697-710)Online publication date: 2-Jun-2016
  • (2016)Loop-oriented array- and field-sensitive pointer analysis for automatic SIMD vectorizationProceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems10.1145/2907950.2907957(41-51)Online publication date: 13-Jun-2016
  • (2016)Vectorization in PyPy's Tracing Just-In-Time CompilerProceedings of the 19th International Workshop on Software and Compilers for Embedded Systems10.1145/2906363.2906384(67-76)Online publication date: 23-May-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media