research-article

SIMD defragmenter: efficient ILP realization on data-parallel architectures

Authors:

Scott MahlkeAuthors Info & Claims

ACM SIGPLAN Notices, Volume 47, Issue 4

Pages 363 - 374

https://doi.org/10.1145/2248487.2151014

Published: 03 March 2012 Publication History

Abstract

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31% gain over prior research approaches for converting ILP to DLP.

References

[1]

A. Aletà, J. Codina, J. Sánchez, and A. González. Graph-partitioning based instruction scheduling for clustered processors. In Proc. of the 34th Annual International Symposium on Microarchitecture, pages 150--159, Dec. 2001.

Digital Library

[2]

R. Allen and K. Kennedy. Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[3]

R. Barik, J. Zhao, and V. Sarkar. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proc. of the 43rd Annual International Symposium on Microarchitecture, Dec. 2010.

Digital Library

[4]

K. Berkel, F. Heinle, P. Meuwissen, K. Moerman, and M. Weiss. Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal Applied Signal Processing, 2005(1):2613--2625, 2005.

Digital Library

[5]

H. Bluethgen, C. Grassmann, W. Raab, and U. Ramacher. A programmable platform for software-defined radio. In Intl. Symposium on System-on-a-Chip, pages 15--20, Nov. 2003.

[6]

A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis of tradeoffs. In Proc. of the 25th Annual International Symposium on Microarchitecture, pages 103--114, Dec. 1992.

Digital Library

[7]

M. Chu, K. Fan, and S. Mahlke. Region-based hierarchical operation partitioning for multicluster processors. In Proc. of the SIGPLAN '03 Conference on Programming Language Design and Implementation, pages 300--311, June 2003.

Digital Library

[8]

N. Clark et al. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 30--40, Dec. 2004.

Digital Library

[9]

N. Clark et al. An architecture framework for transparent instruction set customization in embedded processors. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 272--283, June 2005.

Digital Library

[10]

N. Clark, A. Hormati, S. Mahlke, and S. Yehia. Scalable subgraph mapping for acyclic computation accelerators. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 147--157, Oct. 2006.

Digital Library

[11]

N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated instruction set customization. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 129--140, Dec. 2003.

Digital Library

[12]

J. Glossner, E. Hokenek, and M. Moudgill. The sandbridge sandblaster communications processor. In Proc. of the 2004 Workshop on Application Specific Processors, pages 53--58, Sept. 2004.

[13]

J. Hiser, S. Carr, and P. Sweany. Global register partitioning. In Proc. of the 9th International Conference on Parallel Architectures and Compilation Techniques, pages 13--23, Oct. 2000.

Digital Library

[14]

Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In Proc. of the 2004 International Symposium on Low Power Electronics and Design, pages 32--37, Aug. 2004.

Digital Library

[15]

Intel. Intel compiler, 2009. software.intel.com/en-us/intel-compilers/.

[16]

R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The vector-thread architecture. In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004.

Digital Library

[17]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proc. of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145--156, June 2000.

Digital Library

[18]

S. Larsen and S. Amarasinghe. Increasing and detecting memory address congruence. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 18--29, Sept. 2002.

Digital Library

[19]

C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. of the 30th Annual International Symposium on Microarchitecture, pages 330--335, 1997.

Digital Library

[20]

Y. Lin et al. Soda: A low-power architecture for software radio. In Proc. of the 33rd Annual International Symposium on Computer Architecture, pages 89--101, June 2006.

Digital Library

[21]

Y. Lin et al. Soda: A high-performance dsp architecture for software-defined radio. IEEE Micro, 27(1):114--123, Jan. 2007.

Digital Library

[22]

A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gating with quality guarantees. In Proc. of the 2009 International Symposium on Low Power Electronics and Design, pages 377--382, Aug. 2009.

Digital Library

[23]

N. Madan, A. Buyuktosunoglu, P. Bose, and M. Annavaram. A case for guarded power gating for multi-core processors. In Proc. of the 17th International Symposium on High-Performance Computer Architecture, Feb. 2011.

Digital Library

[24]

D. Nuzman et al. Vapor simd: Auto-vectorize once, run everywhere. In Proc. of the 2011 International Symposium on Code Generation and Optimization, pages 151--160, Apr. 2011.

Digital Library

[25]

D. Nuzman and A. Zaks. Outer-loop vectorization - revisited for short simd architectures. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 2--11, 2008.

Digital Library

[26]

OpenIMPACT. The OpenIMPACT IA-64 compiler, 2005. http://gelato.uiuc.edu/.

[27]

H. Park, K. Fan, M. Kudlur, and S. Mahlke. Modulo graph embedding: Mapping applications onto coarse-grained reconfigurable architectures. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 136--146, Oct. 2006.

Digital Library

[28]

H. Park, K. Fan, S. Mahlke, T. Oh, H. Kim, and H. seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 166--176, Oct. 2008.

Digital Library

[29]

J. Park, D. Shin, N. Chang, and M. Pedram. Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors. In Proc. of the 2010 International Symposium on Low Power Electronics and Design, pages 419--424, Aug. 2010.

Digital Library

[30]

F. Semiconductor. Altivec, 2009. www.freescale.com/altivec.

[31]

J. Shin, J. Chame, and M. W. Hall. Compiler-controlled caching in superword register files for multimedia extension architectures. In Proc. of the 11th International Conference on Parallel Architectures and Compilation Techniques, pages 45--55, 2005.

Digital Library

[32]

D. Talla, L. K. John, and D. Burger. Bottlenecks in multimedia processing with simd style extensions and architectural enhancements. IEEE Transactions on Computers, 52(8):1015--1031, 2003.

Digital Library

[33]

M. Woh et al. From SODA to scotch: The evolution of a wireless baseband processor. In Proc. of the 41st Annual International Symposium on Microarchitecture, pages 152--163, Nov. 2008.

Digital Library

[34]

M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. AnySP: Anytime Anywhere Anyway Signal Processing. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 128--139, June 2009.

Digital Library

Cited By

Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Feng JHe YTao QMa H(2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1832522
Zhou HXue J(2016)A Compiler Approach for Exploiting Partial SIMD ParallelismACM Transactions on Architecture and Code Optimization10.1145/288610113:1(1-26)Online publication date: 28-Mar-2016
https://doi.org/10.1145/2886101
Show More Cited By

Index Terms

SIMD defragmenter: efficient ILP realization on data-parallel architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD ...
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS '12

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD ...
MacroSS: macro-SIMDization of streaming applications
ASPLOS '10

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 47, Issue 4

ASPLOS '12

April 2012

453 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2248487

Issue’s Table of Contents

ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
March 2012
476 pages
ISBN:9781450307598
DOI:10.1145/2150976
General Chair:
Tim Harris
Microsoft Research
,
Program Chair:
Michael L. Scott
University of Rochester

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2012

Published in SIGPLAN Volume 47, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Feng JHe YTao QMa H(2022)An SLP Vectorization Method Based on Equivalent Extended TransformationWireless Communications & Mobile Computing10.1155/2022/18325222022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1832522
Zhou HXue J(2016)A Compiler Approach for Exploiting Partial SIMD ParallelismACM Transactions on Architecture and Code Optimization10.1145/288610113:1(1-26)Online publication date: 28-Mar-2016
https://doi.org/10.1145/2886101
Zhou HXue JFranke BWu YRastello F(2016)Exploiting mixed SIMD parallelism by reducing data reorganization overheadProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854054(59-69)Online publication date: 29-Feb-2016
https://dl.acm.org/doi/10.1145/2854038.2854054
Cebrian JBalem TBarredo ACasas MMoreto MRos AJimborean A(2022)Compiler-Assisted Compaction/Restoration of SIMD InstructionsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309101533:4(779-791)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPDS.2021.3091015
Son YKang SUm HLee SHam JKim DPark Y(2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
https://doi.org/10.3390/electronics10232960
Porpodas VRatnalikar P(2021)PostSLP: Cross-Region Vectorization of Fully or Partially Vectorized CodeLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_2(15-31)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_2
Rocha RPorpodas VPetoumenos PGóes LWang ZCole MLeather HPouchet LJimborean A(2020)Vectorization-aware loop unrolling with seed forwardingProceedings of the 29th International Conference on Compiler Construction10.1145/3377555.3377890(1-13)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3377555.3377890
Barredo ACebrian JMoreto MCasas MValero M(2020)Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00064(717-728)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00064
Porpodas VRocha RBrevnov EGóes LMattson TKandemir MJimborean AMoseley T(2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314897
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents