tutorial

Architectural Specialization for Inter-Iteration Loop Dependence Patterns

Authors:

Shreesha Srinath,

Christopher BattenAuthors Info & Claims

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 583 - 595

https://doi.org/10.1109/MICRO.2014.31

Published: 13 December 2014 Publication History

Abstract

Hardware specialization is an increasingly common technique to enable improved performance and energy efficiency in spite of the diminished benefits of technology scaling. This paper proposes a new approach called explicit loop specialization (XLOOPS) based on the idea of elegantly encoding inter-iteration loop dependence patterns in the instruction set. XLOOPS supports a variety of inter-iteration data-and control-dependence patterns for both single and nested loops. The XLOOPS hardware/software abstraction requires only lightweight changes to a general-purpose compiler to generate XLOOPS binaries and enables executing these binaries on: (1) traditional microarchitectures with minimal performance impact, (2) specialized microarchitectures to improve performance and/or energy efficiency, and (3) adaptive microarchitectures that can seamlessly migrate loops between traditional and specialized execution to dynamically trade-off performance vs. energy efficiency. We evaluate XLOOPS using a vertically integrated research methodology and show compelling performance and energy efficiency improvements compared to both simple and complex general-purpose processors.

References

[1]

K. Atasu, L. Pozzi, and P. Ienne. Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints. Design Automation Conf., Jun 2003.

Digital Library

[2]

N. Binkert et al. The Gem5 Simulator. SIGARCH Comput. Archit. News, 39(2):1--7, May 2011.

Digital Library

[3]

S. Campanoni et al. HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs. Int'l Symp. on Computer Architecture, Jun 2014.

Digital Library

[4]

N. Clark et al. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization. Int'l Symp. on Microarchitecture, Dec 2004.

Digital Library

[5]

L. Codrescu et al. Hexagon DSP: An Architecture Optimized for Mobile Multimedia and Communications. IEEE Micro, 34(2):34--43, Mar/Apr 2014.

[6]

J. Cong et al. Application-Specific Instruction Generation for Configurable Processor Architectures. Int'l Symp. on Field Programmable Gate Arrays, Feb 2004.

Digital Library

[7]

W. J. Dally et al. Efficient Embedded Computing. IEEE Computer, 47(7):27--32, Jul 2008.

Digital Library

[8]

R. Espasa, M. Valero, and J. E. Smith. Vector Architectures: Past, Present, and Future. Int'l Symp. on Supercomputing, Jul 1998.

Digital Library

[9]

G. Goff, K. Kennedey, and C.-W. Tseng. Practical Dependence Testing. ACM SIGPLAN Conf. on Programming Language Design and Implementation, Jun 1991.

Digital Library

[10]

V. Govindaraju et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro, 32(5):38--51, Sep/Oct 2012.

Digital Library

[11]

V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically Specialized Datapaths for Energy-Efficient Computing. Int'l Symp. on High-Performance Computer Architecture, Feb 2011.

Digital Library

[12]

V. Govindaraju, T. Nowatzki, and K. Sankaralingam. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. Int'l Conf. on Parallel Architectures and Compilation Techniques, Sep 2013.

Digital Library

[13]

S. Gupta et al. Bundled Execution of Recurring Traces for Energy-efficient General Purpose Processing. Int'l Symp. on Microarchitecture, Dec 2011.

Digital Library

[14]

M. R. Guthaus et al. MiBench: A Free, Commercially Representative Embedded Benchmark Suite. IEEE Annual Workshop on Workload Characterization, Dec 2001.

Digital Library

[15]

T. Harris, J. Larus, and R. Rajwar. Transactional Memory, 2nd edition. Synthesis Lectures on Computer Architecture, 5(1):1--263, 2010.

Digital Library

[16]

C. Jesshope. Implementing an Efficient Vector Instruction Set in a Chip Multiprocessor Using Micro-Threaded Pipelines. Australia Computer Science Communications, 23(4):80--88, 2001.

Digital Library

[17]

C. Kozyrakis and D. Patterson. Scalable Vector Processors for Embedded Systems. IEEE Micro, 23(6):36--45, Nov 2003.

Digital Library

[18]

R. Krashinsky et al. The Vector-Thread Architecture. Int'l Symp. on Computer Architecture, Jun 2004.

Digital Library

[19]

V. Krishnan and J. Torrellas. A Chip-Multiprocessor Architecture with Speculative Multithreading. IEEE Computer, 48(9):866--880, Sep 1999.

Digital Library

[20]

S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors. Int'l Symp. on Computer Architecture, Jun 2007.

Digital Library

[21]

Y. Lee et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerator Cores. Int'l Symp. on Computer Architecture, Jun 2011.

Digital Library

[22]

S. Li et al. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. Int'l Symp. on Microarchitecture, Dec 2009.

Digital Library

[23]

E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computer Architecture. IEEE Micro, 28(2):39--55, Mar/Apr 2008.

Digital Library

[24]

The LLVM Compiler Infrastructure Project. Online Webpage, 2011 (accessed February, 2011). http://www.llvm.org.

[25]

D. Lockhart, G. Zibrat, and C. Batten. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. Int'l Symp. on Microarchitecture, Dec 2014.

Digital Library

[26]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0: A Tool to Model Large Caches. HP Technical Report HPL-2009-85, 2009.

[27]

Polyhedral Benchmark Suite. Online Webpage, 2014 (accessed May, 2014). http://www.cse.ohio-state.edu/~pouchet/software/polybench.

[28]

D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, Mar 2010.

Digital Library

[29]

J. Shun et al. Brief Announcement: The Problem Based Benchmark Suite. Symp. on Parallel Algorithms and Architectures, Jun 2012.

Digital Library

[30]

G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. Int'l Symp. on Computer Architecture, Jun 1995.

Digital Library

[31]

J. G. Steffan et al. A Scalable Approach to Thread-Level Speculation. Int'l Symp. on Computer Architecture, May 2000.

Digital Library

[32]

TMS320C28x Floating Point Unit and Instruction Set. Reference Guide, 2008. http://www.ti.com/lit/ug/sprueo2a/sprueo2a.pdf.

[33]

G. Venkatesh et al. QsCores: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores. Int'l Symp. on Microarchitecture, 2011.

Digital Library

[34]

J. Wawrzynek et al. Spert-II: A Vector Microprocessor System. IEEE Computer, 29(3):79--86, Mar 1996.

Digital Library

[35]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100 GPU Architecture. IEEE Micro, 31(2):50--59, Mar/Apr 2011.

Digital Library

[36]

H. Zhong et al. Uncovering Hidden Loop Level Parallelism in Sequential Applications. Int'l Symp. on High-Performance Computer Architecture, Feb 2008.

Cited By

Vilim MRucker AOlukotun KMartínez JDuato JJohn L(2021)AurochsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00039(402-415)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00039
Nematollahi NSadrosadati MFalahati HBarkhordar MDrumond MSarbazi-Azad HFalsafi B(2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
https://dl.acm.org/doi/10.1145/3429981
Margerm SSharifian AGuha AShriraman APokam GOskin MInoue K(2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00028
Show More Cited By

Index Terms

Architectural Specialization for Inter-Iteration Loop Dependence Patterns
1. Hardware
  1. Integrated circuits
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific instruction set processors
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Control structures

Recommendations

Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors
MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture

With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. The register file read time, the wake up and selection logic traversal delay and the bypass network transit delay with ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Fusing loops with backward inter loop data dependence

Loop fusion has been a traditional transformation in optimizing and parallelizing compilers. There are various benefits in performing loop fusion. Most trivially, loop fusion reduces the overheads associated in implementing loops in computer hardware by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

December 2014

697 pages

ISBN:9781479969982

General Chair:
Krisztian Flautner
ARM
,
Program Chairs:
Thomas F. Wenisch
University of Michigan
,
Emre Ozer
ARM
,
Publications Chair:
Michael Ferdman
Stony Brook University

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 December 2014

Check for updates

Qualifiers

Tutorial
Research
Refereed limited

Conference

MICRO-47

Sponsor:

SIGMICRO

MICRO-47: The 47th Annual IEEE/ACM International Symposium of Microarchitecture

December 13 - 17, 2014

Cambridge, United Kingdom

Acceptance Rates

MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
124
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vilim MRucker AOlukotun KMartínez JDuato JJohn L(2021)AurochsProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00039(402-415)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00039
Nematollahi NSadrosadati MFalahati HBarkhordar MDrumond MSarbazi-Azad HFalsafi B(2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
https://dl.acm.org/doi/10.1145/3429981
Margerm SSharifian AGuha AShriraman APokam GOskin MInoue K(2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00028
Voitsechov DPort OEtsion YOskin MInoue K(2018)Inter-thread communication in multithreaded, reconfigurable coarse-grain arraysProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00013(42-54)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00013
Nowatzki TGangadhar VArdalani NSankaralingam K(2017)Stream-Dataflow AccelerationACM SIGARCH Computer Architecture News10.1145/3140659.308025545:2(416-429)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080255
Kim JJiang STorng CWang MSrinath SIlbeyi BAl-Hawaj KBatten CHunter HMoreno JEmer JSanchez D(2017)Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3136952(759-773)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3136952
Nowatzki TGangadhar VArdalani NSankaralingam K(2017)Stream-Dataflow AccelerationProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080255(416-429)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080255
Stanic MPalomar OHayes TRatkovic ICristal AUnsal OValero M(2017)An Integrated Vector-Scalar Design on an In-Order ARM CoreACM Transactions on Architecture and Code Optimization10.1145/307561814:2(1-26)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3075618
Jain ALaurenzano MTang LMars JHsu WYang CLipasti MLee H(2016)Continuous shape shiftingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195666(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195666
Nowatzki TSankaralingam K(2016)Analyzing Behavior Specialized AccelerationACM SIGARCH Computer Architecture News10.1145/2980024.287241244:2(697-711)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2980024.2872412
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten