Article

Exploiting Vector Parallelism in Software Pipelined Loops

Authors:

Saman AmarasingheAuthors Info & Claims

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

Pages 119 - 129

https://doi.org/10.1109/MICRO.2005.20

Published: 12 November 2005 Publication History

Publisher Site Get Access

Abstract

An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting vector parallelism in software pipelined loops. The proposed methodology. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where vectorization decisions are more amenable to cost analysis. This is unique in that traditional vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38x, with average of 1.11x

References

[1]

{1} Codeplay VectorC. http://www.codeplay.com.

[2]

{2} IBM XL C/C++ and Fortran compilers. http://www-306.ibm.com/software/awdtools/xlcpp/.

[3]

{3} Trimaran Research Infrastructure. http://www.trimaran.org.

[4]

{4} VAST-C/AltiVec. http://www.crescentbaysoftware.com.

[5]

{5} A. Aletà, J. M. Codina, J. Sánchez, and A. González. Graph-Partitioning Based Instruction Scheduling for Clustered Processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 150-159, Austin, TX, December 2001.

Digital Library

[6]

{6} R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, California, 2001.

Digital Library

[7]

{7} A. J. Bik. The Software Vectorization Handbook: Applying Multimedia Extensions for Maximum Performance. Intel Press, Hillsboro, OR, 2004.

Digital Library

[8]

{8} J. M. Codina, J. Sánchez, and A. González. A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques , pages 175-184, Barcelona, Spain, September 2001.

Digital Library

[9]

{9} A. Darte. On the complexity of loop fusion. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 149-157, Newport Beach, CA, October 1999.

Digital Library

[10]

{10} D. J. DeVries. A Vectorizing SUIF Compiler: Implementation and Performance. Master's thesis, University of Toronto, June 1997.

[11]

{11} K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales. AltiVec Extension to PowerPC Accelerates Media Processing. IEEE Micro, 20(2):85-95, March 2000.

Digital Library

[12]

{12} A. E. Eichenberger and E. S. Davidson. Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 180-191, Ann Arbor, MI, November 1995.

Digital Library

[13]

{13} A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the SIGPLAN '04 Conference on Programming Language Design and Implementation, pages 82-93, Washington, DC, June 2004.

Digital Library

[14]

{14} J. Fridman and Z. Greenfield. The TigerSHARC DSP Architecture. IEEE Micro, 20(1):66-76, January 2000.

Digital Library

[15]

{15} R. A. Huff. Lifetime-Sensitive Modulo Scheduling. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258-267, Albuquerque, NM, June 1993.

Digital Library

[16]

{16} B. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell System Technical Journal, 49:291-307, February 1970.

[17]

{17} A. Krall and S. Lelait. Compilation Techniques for Multimedia Processors. International Journal of Parallel Programming , 28(4):347-361, August 2000.

[18]

{18} A. Kudriavtsev and P. Kogge. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pages 147-156, Chicago, IL, June 2005.

Digital Library

[19]

{19} M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 318-328, Atlanta, GA, June 1988.

Digital Library

[20]

{20} S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145-156, Vancouver, BC, June 2000.

Digital Library

[21]

{21} S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and Detecting Memory Address Congruence. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 18-29, Charlottesville, VA, September 2002.

Digital Library

[22]

{22} D. M. Lavery and W. mei W. Hwu. Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 126-137, Paris, France, December 1996.

Digital Library

[23]

{23} R. Lee. Subword Parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.

Digital Library

[24]

{24} C. McNairy and D. Soltis. Itanium 2 Processor Microarchitecture. IEEE Micro, 23(2):44-55, March 2003.

Digital Library

[25]

{25} D. Naishlos. Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit, pages 105-118, 2004.

[26]

{26} D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pages 2-11, San Jose, CA, October 2003.

Digital Library

[27]

{27} E. Nystrom and A. E. Eichenberger. Effective Cluster Assignment for Modulo Scheduling. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 103-114, Dallas, TX, December 1998.

Digital Library

[28]

{28} I. Pryanishnikov, A. Krall, and N. Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proceedings of the 5th Workshop on Media and Streaming Processors, pages 50-57, San Diego, CA, December 2003.

[29]

{29} S. K. Raman, V. Pentkovski, and J. Keshava. Implementing Streaming SIMD Extensions on the Pentium III Processor. IEEE Micro, 20(4):47-57, July 2000.

Digital Library

[30]

{30} B. Rau, M. Lee, P. Tirumalai, and M. Schlansker. Register Allocation for Software Pipelined Loops. In Proceedings of the SIGPLAN '92 Conference on Programming Language Design and Implementation, pages 283-299, San Francisco, CA, June 1992.

Digital Library

[31]

{31} B. R. Rau. Iterative Modulo Scheduling. Technical Report HPL-94-115, Hewlett Packard Company, November 1995.

[32]

{32} B. R. Rau, M. S. Schlansker, and P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In Proceedings of the 25th Annual International Symposium on Microarchitecture , pages 158-169, Portland, OR, December 1992.

Digital Library

[33]

{33} J. Shin, M. Hall, and J. Chame. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization , pages 165-175, San Jose, CA, March 2005.

Digital Library

[34]

{34} J. Shin, J. Chame, and M. Hall. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architecture. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques , pages 45-55, Charlottesville, VA, September 2002.

Digital Library

[35]

{35} N. Sreraman and R. Govindarajan. A Vectorizing Compiler for Multimedia Extensions. International Journal of Parallel Programming, 28(4):363-400, August 2000.

[36]

{36} R. E. Tarjan. Depth First Search and Linear Graph Algorithms. SIAM Journal of Computing, 1(2):146-160, June 1972.

Digital Library

[37]

{37} M. Tremblay, M. O'Connor, V. Narayanan, and L. He. VIS Speeds New Media Processing. IEEE Micro, 16(4):10-20, August 1996.

Digital Library

[38]

{38} R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12):31-37, December 1994.

Digital Library

[39]

{39} M. J. Wolfe. High Performance Compilers for Parallel Computing . Addison-Wesley, Redwood City, California, 1996.

Digital Library

[40]

{40} P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In Proceedings of the International Symposium on Code Generation and Optimization, pages 153-164, San Jose, CA, March 2005.

Digital Library

[41]

{41} P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An Integrated Simdization Framework Using Virtual Vectors. In Proceedings of the 19th ACM International Conference on Supercomputing, pages 169-178, Cambridge, MA, June 2005.

Digital Library

[42]

{42} J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero. Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 160- 169, Austin, TX, December 2001.

Digital Library

[43]

{43} Y. Zhao and K. Kennedy. Scalarization on Short Vector Machines. In IEEE International Symposium on Performance Analysis of Systems and Software, pages 187-196, Austin, TX, March 2005.

Digital Library

Cited By

Li HHan JHan D(2020)Leveraging SIMD parallelism for accelerating network applicationsProceedings of the 4th Asia-Pacific Workshop on Networking10.1145/3411029.3411033(23-29)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1145/3411029.3411033
Kim SHan H(2012)Efficient SIMD code generation for irregular kernelsACM SIGPLAN Notices10.1145/2370036.214582447:8(55-64)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2370036.2145824
Kim SHan HRamanujam JSadayappan P(2012)Efficient SIMD code generation for irregular kernelsProceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming10.1145/2145816.2145824(55-64)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2145816.2145824
Show More Cited By

Index Terms

Exploiting Vector Parallelism in Software Pipelined Loops

Recommendations

Register allocation for software pipelined multi-dimensional loops
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper ...
Register allocation for software pipelined multi-dimensional loops
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper ...
Register allocation for software pipelined multidimensional loops

This article investigates register allocation for software pipelined multidimensional loops where the execution of successive iterations from an n-dimensional loop is overlapped. For single loop software pipelining, the lifetimes of a loop variable in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

November 2005

350 pages

ISBN:0769524400

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

Article

Conference

Micro-38

Sponsor:

SIGMICRO

Micro-38: The 38th Annual IEEE/ACM International Symposium on Microarchitecture

November 12 - 16, 2005

Barcelona, Spain

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
23
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li HHan JHan D(2020)Leveraging SIMD parallelism for accelerating network applicationsProceedings of the 4th Asia-Pacific Workshop on Networking10.1145/3411029.3411033(23-29)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1145/3411029.3411033
Kim SHan H(2012)Efficient SIMD code generation for irregular kernelsACM SIGPLAN Notices10.1145/2370036.214582447:8(55-64)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2370036.2145824
Kim SHan HRamanujam JSadayappan P(2012)Efficient SIMD code generation for irregular kernelsProceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming10.1145/2145816.2145824(55-64)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2145816.2145824
Barik RZhao JSarkar V(2010)Efficient Selection of Vector Instructions Using Dynamic ProgrammingProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.38(201-212)Online publication date: 4-Dec-2010
https://dl.acm.org/doi/10.1109/MICRO.2010.38
Manley RGregg D(2009)Mapping streaming languages to general purpose processors through vectorizationProceedings of the 22nd international conference on Languages and Compilers for Parallel Computing10.1007/978-3-642-13374-9_7(95-110)Online publication date: 8-Oct-2009
https://dl.acm.org/doi/10.1007/978-3-642-13374-9_7
Hampton MAsanovic KSoffa MDuesterwald E(2008)Compiling for vector-thread architecturesProceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1356058.1356085(205-215)Online publication date: 6-Apr-2008
https://dl.acm.org/doi/10.1145/1356058.1356085
Wang MWu GWang Z(2007)Instruction selection for subword level parallelism optimizations for application specific instruction processorsProceedings of the 5th international conference on Parallel and Distributed Processing and Applications10.5555/2395970.2396061(946-957)Online publication date: 29-Aug-2007
https://dl.acm.org/doi/10.5555/2395970.2396061
Tanaka HTakeuchi YSakanushi KImai MTagawa HOta YMatsumoto N(2007)Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision DiagramIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1093/ietfec/e90-a.12.2800E90-A:12(2800-2809)Online publication date: 1-Dec-2007
https://dl.acm.org/doi/10.1093/ietfec/e90-a.12.2800
Hampton MAsanović KEgan GMuraoka Y(2006)Implementing virtual memory in a vector processor with software restart markersProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183422(135-144)Online publication date: 28-Jun-2006
https://dl.acm.org/doi/10.1145/1183401.1183422

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents