Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/MICRO.2005.20acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Exploiting Vector Parallelism in Software Pipelined Loops

Published: 12 November 2005 Publication History

Abstract

An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting vector parallelism in software pipelined loops. The proposed methodology. Our approach results in better resource utilization and allows for software pipelining with shorter initiation intervals. The proposed optimization is applied in the compiler backend, where vectorization decisions are more amenable to cost analysis. This is unique in that traditional vectorization optimizations are usually carried out at the statement level. Although our technique most naturally complements statically scheduled machines, we believe it is applicable to any architecture that tightly integrates support for instruction and data level parallelism. We evaluate our methodology using nine SPEC FP benchmarks. In comparison to software pipelining, our approach achieves a maximum speedup of 1.38x, with average of 1.11x

References

[1]
{1} Codeplay VectorC. http://www.codeplay.com.
[2]
{2} IBM XL C/C++ and Fortran compilers. http://www-306.ibm.com/software/awdtools/xlcpp/.
[3]
{3} Trimaran Research Infrastructure. http://www.trimaran.org.
[4]
{4} VAST-C/AltiVec. http://www.crescentbaysoftware.com.
[5]
{5} A. Aletà, J. M. Codina, J. Sánchez, and A. González. Graph-Partitioning Based Instruction Scheduling for Clustered Processors. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 150-159, Austin, TX, December 2001.
[6]
{6} R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco, California, 2001.
[7]
{7} A. J. Bik. The Software Vectorization Handbook: Applying Multimedia Extensions for Maximum Performance. Intel Press, Hillsboro, OR, 2004.
[8]
{8} J. M. Codina, J. Sánchez, and A. González. A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques , pages 175-184, Barcelona, Spain, September 2001.
[9]
{9} A. Darte. On the complexity of loop fusion. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 149-157, Newport Beach, CA, October 1999.
[10]
{10} D. J. DeVries. A Vectorizing SUIF Compiler: Implementation and Performance. Master's thesis, University of Toronto, June 1997.
[11]
{11} K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales. AltiVec Extension to PowerPC Accelerates Media Processing. IEEE Micro, 20(2):85-95, March 2000.
[12]
{12} A. E. Eichenberger and E. S. Davidson. Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 180-191, Ann Arbor, MI, November 1995.
[13]
{13} A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the SIGPLAN '04 Conference on Programming Language Design and Implementation, pages 82-93, Washington, DC, June 2004.
[14]
{14} J. Fridman and Z. Greenfield. The TigerSHARC DSP Architecture. IEEE Micro, 20(1):66-76, January 2000.
[15]
{15} R. A. Huff. Lifetime-Sensitive Modulo Scheduling. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 258-267, Albuquerque, NM, June 1993.
[16]
{16} B. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell System Technical Journal, 49:291-307, February 1970.
[17]
{17} A. Krall and S. Lelait. Compilation Techniques for Multimedia Processors. International Journal of Parallel Programming , 28(4):347-361, August 2000.
[18]
{18} A. Kudriavtsev and P. Kogge. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pages 147-156, Chicago, IL, June 2005.
[19]
{19} M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation, pages 318-328, Atlanta, GA, June 1988.
[20]
{20} S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145-156, Vancouver, BC, June 2000.
[21]
{21} S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and Detecting Memory Address Congruence. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 18-29, Charlottesville, VA, September 2002.
[22]
{22} D. M. Lavery and W. mei W. Hwu. Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs. In Proceedings of the 29th Annual International Symposium on Microarchitecture, pages 126-137, Paris, France, December 1996.
[23]
{23} R. Lee. Subword Parallelism with MAX-2. IEEE Micro, 16(4):51-59, August 1996.
[24]
{24} C. McNairy and D. Soltis. Itanium 2 Processor Microarchitecture. IEEE Micro, 23(2):44-55, March 2003.
[25]
{25} D. Naishlos. Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit, pages 105-118, 2004.
[26]
{26} D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pages 2-11, San Jose, CA, October 2003.
[27]
{27} E. Nystrom and A. E. Eichenberger. Effective Cluster Assignment for Modulo Scheduling. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 103-114, Dallas, TX, December 1998.
[28]
{28} I. Pryanishnikov, A. Krall, and N. Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proceedings of the 5th Workshop on Media and Streaming Processors, pages 50-57, San Diego, CA, December 2003.
[29]
{29} S. K. Raman, V. Pentkovski, and J. Keshava. Implementing Streaming SIMD Extensions on the Pentium III Processor. IEEE Micro, 20(4):47-57, July 2000.
[30]
{30} B. Rau, M. Lee, P. Tirumalai, and M. Schlansker. Register Allocation for Software Pipelined Loops. In Proceedings of the SIGPLAN '92 Conference on Programming Language Design and Implementation, pages 283-299, San Francisco, CA, June 1992.
[31]
{31} B. R. Rau. Iterative Modulo Scheduling. Technical Report HPL-94-115, Hewlett Packard Company, November 1995.
[32]
{32} B. R. Rau, M. S. Schlansker, and P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In Proceedings of the 25th Annual International Symposium on Microarchitecture , pages 158-169, Portland, OR, December 1992.
[33]
{33} J. Shin, M. Hall, and J. Chame. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization , pages 165-175, San Jose, CA, March 2005.
[34]
{34} J. Shin, J. Chame, and M. Hall. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architecture. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques , pages 45-55, Charlottesville, VA, September 2002.
[35]
{35} N. Sreraman and R. Govindarajan. A Vectorizing Compiler for Multimedia Extensions. International Journal of Parallel Programming, 28(4):363-400, August 2000.
[36]
{36} R. E. Tarjan. Depth First Search and Linear Graph Algorithms. SIAM Journal of Computing, 1(2):146-160, June 1972.
[37]
{37} M. Tremblay, M. O'Connor, V. Narayanan, and L. He. VIS Speeds New Media Processing. IEEE Micro, 16(4):10-20, August 1996.
[38]
{38} R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12):31-37, December 1994.
[39]
{39} M. J. Wolfe. High Performance Compilers for Parallel Computing . Addison-Wesley, Redwood City, California, 1996.
[40]
{40} P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In Proceedings of the International Symposium on Code Generation and Optimization, pages 153-164, San Jose, CA, March 2005.
[41]
{41} P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An Integrated Simdization Framework Using Virtual Vectors. In Proceedings of the 19th ACM International Conference on Supercomputing, pages 169-178, Cambridge, MA, June 2005.
[42]
{42} J. Zalamea, J. Llosa, E. Ayguadé, and M. Valero. Modulo Scheduling with Integrated Register Spilling for Clustered VLIW Architectures. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 160- 169, Austin, TX, December 2001.
[43]
{43} Y. Zhao and K. Kennedy. Scalarization on Short Vector Machines. In IEEE International Symposium on Performance Analysis of Systems and Software, pages 187-196, Austin, TX, March 2005.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
November 2005
350 pages
ISBN:0769524400

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 12 November 2005

Check for updates

Qualifiers

  • Article

Conference

Micro-38
Sponsor:

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Leveraging SIMD parallelism for accelerating network applicationsProceedings of the 4th Asia-Pacific Workshop on Networking10.1145/3411029.3411033(23-29)Online publication date: 3-Aug-2020
  • (2012)Efficient SIMD code generation for irregular kernelsACM SIGPLAN Notices10.1145/2370036.214582447:8(55-64)Online publication date: 25-Feb-2012
  • (2012)Efficient SIMD code generation for irregular kernelsProceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming10.1145/2145816.2145824(55-64)Online publication date: 25-Feb-2012
  • (2010)Efficient Selection of Vector Instructions Using Dynamic ProgrammingProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.38(201-212)Online publication date: 4-Dec-2010
  • (2009)Mapping streaming languages to general purpose processors through vectorizationProceedings of the 22nd international conference on Languages and Compilers for Parallel Computing10.1007/978-3-642-13374-9_7(95-110)Online publication date: 8-Oct-2009
  • (2008)Compiling for vector-thread architecturesProceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization10.1145/1356058.1356085(205-215)Online publication date: 6-Apr-2008
  • (2007)Instruction selection for subword level parallelism optimizations for application specific instruction processorsProceedings of the 5th international conference on Parallel and Distributed Processing and Applications10.5555/2395970.2396061(946-957)Online publication date: 29-Aug-2007
  • (2007)Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision DiagramIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1093/ietfec/e90-a.12.2800E90-A:12(2800-2809)Online publication date: 1-Dec-2007
  • (2006)Implementing virtual memory in a vector processor with software restart markersProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183422(135-144)Online publication date: 28-Jun-2006

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media