Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

SIMD defragmenter: efficient ILP realization on data-parallel architectures

Published: 03 March 2012 Publication History

Abstract

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31% gain over prior research approaches for converting ILP to DLP.

References

[1]
A. Aletà, J. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proc. of the 34th Annual International Symposium on Microarchitecture, pages 150--159, Dec. 2001.
[2]
R. Allen and K. Kennedy. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.
[3]
R. Barik, J. Zhao, and V. Sarkar. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proc. of the 43rd Annual International Symposium on Microarchitecture, Dec. 2010.
[4]
K. Berkel, F. Heinle, P. Meuwissen, K. Moerman, and M. Weiss. Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal Applied Signal Processing, 2005(1):2613--2625, 2005.
[5]
H. Bluethgen, C. Grassmann, W. Raab, and U. Ramacher. A programmable platform for software-defined radio. In Intl. Symposium on System-on-a-Chip, pages 15--20, Nov. 2003.
[6]
A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs. In Proc. of the 25th Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1992.
[7]
M. Chu, K. Fan, and S. Mahlke. Region-based hierarchical operation partitioning for multicluster processors. In Proc. of the SIGPLAN '03 Conference on Programming Language Design and Implementation, pages 300--311, June 2003.
[8]
N. Clark et al. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 30--40, Dec. 2004.
[9]
N. Clark et al. An architecture framework for transparent instruction set customization in embedded processors. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 272--283, June 2005.
[10]
N. Clark, A. Hormati, S. Mahlke, and S. Yehia. Scalable subgraph mapping for acyclic computation accelerators. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 147--157, Oct. 2006.
[11]
N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated instruction set customization. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 129--140, Dec. 2003.
[12]
J. Glossner, E. Hokenek, and M. Moudgill. The sandbridge sandblaster communications processor. In Proc. of the 2004 Workshop on Application Specific Processors, pages 53--58, Sept. 2004.
[13]
J. Hiser, S. Carr, and P. Sweany. Global register partitioning. In Proc. of the 9th International Conference on Parallel Architectures and Compilation Techniques, pages 13--23, Oct. 2000.
[14]
Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In Proc. of the 2004 International Symposium on Low Power Electronics and Design, pages 32--37, Aug. 2004.
[15]
Intel. Intel compiler, 2009. software.intel.com/en-us/intel-compilers/.
[16]
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The vector-thread architecture. In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004.
[17]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proc. of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145--156, June 2000.
[18]
S. Larsen and S. Amarasinghe. Increasing and detecting memory address congruence. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 18--29, Sept. 2002.
[19]
C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. of the 30th Annual International Symposium on Microarchitecture, pages 330--335, 1997.
[20]
Y. Lin et al. Soda: A low-power architecture for software radio. In Proc. of the 33rd Annual International Symposium on Computer Architecture, pages 89--101, June 2006.
[21]
Y. Lin et al. Soda: A high-performance dsp architecture for software-defined radio. IEEE Micro, 27(1):114--123, Jan. 2007.
[22]
A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gating with quality guarantees. In Proc. of the 2009 International Symposium on Low Power Electronics and Design, pages 377--382, Aug. 2009.
[23]
N. Madan, A. Buyuktosunoglu, P. Bose, and M. Annavaram. A case for guarded power gating for multi-core processors. In Proc. of the 17th International Symposium on High-Performance Computer Architecture, Feb. 2011.
[24]
D. Nuzman et al. Vapor simd: Auto-vectorize once, run everywhere. In Proc. of the 2011 International Symposium on Code Generation and Optimization, pages 151--160, Apr. 2011.
[25]
D. Nuzman and A. Zaks. Outer-loop vectorization - revisited for short simd architectures. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 2--11, 2008.
[26]
OpenIMPACT. The OpenIMPACT IA-64 compiler, 2005. http://gelato.uiuc.edu/.
[27]
H. Park, K. Fan, M. Kudlur, and S. Mahlke. Modulo graph embedding: Mapping applications onto coarse-grained reconfigurable architectures. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 136--146, Oct. 2006.
[28]
H. Park, K. Fan, S. Mahlke, T. Oh, H. Kim, and H. seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 166--176, Oct. 2008.
[29]
J. Park, D. Shin, N. Chang, and M. Pedram. Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors. In Proc. of the 2010 International Symposium on Low Power Electronics and Design, pages 419--424, Aug. 2010.
[30]
F. Semiconductor. Altivec, 2009. www.freescale.com/altivec.
[31]
J. Shin, J. Chame, and M. W. Hall. Compiler-controlled caching in superword register files for multimedia extension architectures. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 45--55, 2005.
[32]
D. Talla, L. K. John, and D. Burger. Bottlenecks in multimedia processing with simd style extensions and architectural enhancements. IEEE Transactions on Computers, 52(8):1015--1031, 2003.
[33]
M. Woh et al. From SODA to scotch: The evolution of a wireless baseband processor. In Proc. of the 41st Annual International Symposium on Microarchitecture, pages 152--163, Nov. 2008.
[34]
M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. AnySP: Anytime Anywhere Anyway Signal Processing. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 128--139, June 2009.

Cited By

View all
  • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
  • (2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
  • (2016)A Compiler Approach for Exploiting Partial SIMD ParallelismACM Transactions on Architecture and Code Optimization10.1145/288610113:1(1-26)Online publication date: 28-Mar-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 47, Issue 4
ASPLOS '12
April 2012
453 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2248487
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
    March 2012
    476 pages
    ISBN:9781450307598
    DOI:10.1145/2150976
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2012
Published in SIGPLAN Volume 47, Issue 4

Check for updates

Author Tags

  1. SIMD architecture
  2. compiler
  3. optimization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
  • (2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
  • (2016)A Compiler Approach for Exploiting Partial SIMD ParallelismACM Transactions on Architecture and Code Optimization10.1145/288610113:1(1-26)Online publication date: 28-Mar-2016
  • (2016)Exploiting mixed SIMD parallelism by reducing data reorganization overheadProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854054(59-69)Online publication date: 29-Feb-2016
  • (2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
  • (2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
  • (2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
  • (2020)Vectorization-aware loop unrolling with seed forwardingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377890(1-13)Online publication date: 22-Feb-2020
  • (2020)Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00064(717-728)Online publication date: Feb-2020
  • (2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media