research-article

Long term parking (LTP): criticality-aware resource allocation in OOO processors

Authors:

Andreas Sembrant,

Trevor Carlson,

Erik Hagersten,

David Black-Shaffer,

Pierre MichaudAuthors Info & Claims

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 334 - 346

https://doi.org/10.1145/2830772.2830815

Published: 05 December 2015 Publication History

Abstract

Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application's critical path.

This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to "park" such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP.

LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 → 32) and register file (128 → 96) sizes while retaining MLP performance and improving energy efficiency.

References

[1]

A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, "A Large, Fast Instruction Window for Tolerating Cache Misses," in Proc. International Symposium on Computer Architecture (ISCA), 2002.

Digital Library

[2]

D. Ernst, A. Hamel, and T. Austin, "Cyclone: A Broadcast-free Dynamic Instruction Scheduler with Selective Replay," in Proc. International Symposium on Computer Architecture (ISCA), 2003.

Digital Library

[3]

E. Morancho, J. M. Llabería, and A. Olivé, "On Reducing Energy-consumption by Late-inserting Instructions into the Issue Queue," in Proc. International Symposium on Low Power Electronics and Design (ISLPED), 2007.

Digital Library

[4]

Y. Kora, K. Yamaguchi, and H. Ando, "MLP-aware Dynamic Instruction Window Resizing for Adaptively Exploiting Both ILP and MLP," in Proc. International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[5]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Comput. Archit. News, 2011.

Digital Library

[6]

J. L. Henning, "SPEC CPU2006 Benchmark Descriptions," SIGARCH Comput. Archit. News, 2006.

Digital Library

[7]

T. Carlson, W. Heirman, O. Allam, S. Kaxiras, and L. Eeckhout, "The Load Slice Core Microarchitecture," in Proc. International Symposium on Computer Architecture (ISCA), 2015.

Digital Library

[8]

S. Palacharla, N. P. Jouppi, and J. E. Smith, "Complexity-effective Superscalar Processors," in Proc. International Symposium on Computer Architecture (ISCA), 1997.

Digital Library

[9]

M. K. Gowan, L. L. Biro, and D. B. Jackson, "Power Considerations in the Design of the Alpha 21264 Microprocessor," in Proc. Design Automation Conference (DAC), 1998.

Digital Library

[10]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in Proc. International Symposium on Microarchitecture (MICRO), 2009.

Digital Library

[11]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A Tool to Model Large Caches," tech. rep., Hewlett Packard Labs, 2009.

[12]

J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions under a Cache Miss," in International Conference on Supercomputing (ICS), 1997.

Digital Library

[13]

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-Of-Order Processors," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 2003.

Digital Library

[14]

H. Zhou, "Dual-core Execution: Building a Highly Scalable Single-thread Instruction Window," in Proc. International Conference on Parallel Architectures and Compilation Techniques (PACT), 2005.

Digital Library

[15]

G. A. Muthler, D. Crowe, S. J. Patel, and S. S. Lumetta, "Instruction Fetch Deferral Using Static Slack," in Proc. International Symposium on Microarchitecture (MICRO), 2002.

Digital Library

[16]

E. Morancho, J. M. Llabería, and A. Olivé, "Recovery Mechanism for Latency Misprediction," in Proc. International Conference on Parallel Architectures and Compilation Techniques (PACT), 2001.

Digital Library

[17]

S. Wallace and N. Bagherzadeh, "A Scalable Register File Architecture for Dynamically Scheduled Processors," in Proc. International Conference on Parallel Architectures and Compilation Techniques (PACT), 1996.

Digital Library

[18]

T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, and V. Vinals, "Delaying Physical Register Allocation through Virtual-Physical Registers," in Proc. International Symposium on Microarchitecture (MICRO), 1999.

Digital Library

[19]

M. Moudgill, K. Pingali, and S. Vassiliadis, "Register Renaming and Dynamic Speculation: An Alternative Approach," in Proc. International Symposium on Microarchitecture (MICRO), 1993.

Digital Library

[20]

A. Cristal, O. J. Santana, and M. Valero, "Toward Kilo-instruction Processors," ACM Transactions on Architecture and Code Optimization, vol. 1, pp. 368--396, Dec. 2004.

Digital Library

[21]

S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, "Continual Flow Pipelines," in Proc. Internationl Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004.

Digital Library

[22]

A. Hilton and A. Roth, "BOLT: Energy-efficient Out-Of-Order Latency-tolerant Execution," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 2010.

[23]

A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan, and K. Lai, "Scalable Load and Store Processing in Latency Tolerant Processors," in Proc. International Symposium on Computer Architecture (ISCA), 2005.

Digital Library

[24]

A. Hilton and A. Roth, "Decoupled Store Completion/Silent Deterministic Replay: Enabling Scalable Data Memory for CPR/CFP Processors," in Proc. International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[25]

S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler, "Late-Binding: Enabling Unordered Load-Store Queues," in Proc. International Symposium on Computer Architecture (ISCA), 2007.

Digital Library

[26]

B. R. Fisk and R. I. Bahar, "The Non-Critical Buffer: Using Load Latency Tolerance to Improve Data Cache Efficiency," in International Conference on Computer Design (ICCD), 1999.

Digital Library

[27]

E. Tune, D. Liang, D. M. Tullsen, and B. Calder, "Dynamic Prediction of Critical Path Instructions," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 2001.

Digital Library

[28]

B. Fields, S. Rubin, and R. Bodík, "Focusing Processor Policies via Critical-path Prediction," in Proc. International Symposium on Computer Architecture (ISCA), 2001.

Digital Library

[29]

B. Calder, D. Grunwald, and J. Emer, "Predictive Sequential Associative Cache," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 1996.

Digital Library

[30]

M. Goshima, K. Nishino, T. Kitamura, Y. Nakashima, S. Tomita, and S.-i. Mori, "A High-speed Dynamic Instruction Scheduling Scheme for Superscalar Processors," in Proc. International Symposium on Microarchitecture (MICRO), 2001.

Digital Library

Cited By

Luo ZSon SRatnasamy SShenker SGavrilovska ATerry D(2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691942
Mori KKosugi SYoshida HShimada HAndo H(2024)Localizing the Tag Comparisons in the Wakeup Logic to Reduce Energy Consumption of the Issue Queue2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00044(493-506)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00044
Matsuo RKoizumi TIrie HSakai SShioya R(2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2023.3289317
Show More Cited By

Recommendations

Long-latency branches: how much do they matter?

Dynamic branch prediction plays a key role in delivering high performance in the modern microprocessors. The cycles between the prediction of a branch and its execution constitute the branch misprediction penalty because a misprediction can be detected ...
Architecture and compiler tradeoffs for a long instruction wordprocessor
ASPLOS III: Proceedings of the third international conference on Architectural support for programming languages and operating systems

A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor ...
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

December 2015

787 pages

ISBN:9781450340342

DOI:10.1145/2830772

General Chair:
Milos Prvulovic
Georgia Tech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

MICRO-48

Sponsor:

SIGMICRO

MICRO-48: The 48th Annual IEEE/ACM International Symposium of Microarchitecture

December 5 - 9, 2015

Waikiki, Hawaii

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
702
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)9

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo ZSon SRatnasamy SShenker SGavrilovska ATerry D(2024)Harvesting memory-bound CPU stall cycles in software with MSHProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691942(57-75)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691942
Mori KKosugi SYoshida HShimada HAndo H(2024)Localizing the Tag Comparisons in the Wakeup Logic to Reduce Energy Consumption of the Issue Queue2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00044(493-506)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00044
Matsuo RKoizumi TIrie HSakai SShioya R(2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2023.3289317
Zhan HWang CWang XYang CLiu XCheng X(2024)Multi: Reduce Energy Overhead of Criticality-Aware Dynamic Instruction Scheduling for Energy Efficiency2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00020(60-67)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00020
Koizumi TShioya RSugita SAmano TDegawa YKadomoto JIrie HSakai S(2023)Clockhands: Rename-free Instruction Set Architecture for Out-of-order ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614272(1-16)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614272
Chen DZhang THuang YZhu JLiu YGou PFeng CLi BWei SLiu LSolihin YHeinrich M(2023)Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible QueuesProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589046(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589046
Lee YLee JRo W(2023)Performance Analysis of Criticality-Aware Out-of-Order Cores for Exploiting MLP2023 International Technical Conference on Circuits/Systems, Computers, and Communications (ITC-CSCC)10.1109/ITC-CSCC58803.2023.10212794(1-4)Online publication date: 25-Jun-2023
https://doi.org/10.1109/ITC-CSCC58803.2023.10212794
Mehta S(2023)Speculative Register Reclamation2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071122(1182-1194)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071122
Diavastos ACarlson T(2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3548681
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten