No abstract available.
Message from the General Chair
Message from the Program Chair
Committees
Reviewers
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation
- Dan Ernst,
- Nam Sung Kim,
- Shidhartha Das,
- Sanjay Pant,
- Rajeev Rao,
- Toan Pham,
- Conrad Ziesler,
- David Blaauw,
- Todd Austin,
- Krisztian Flautner,
- Trevor Mudge
With increasing clock frequencies and silicon integration,power aware computing has become a critical concernin the design of embedded processors and systems-on-chip.One of the more effective and widely used methods for power-awarecomputing is dynamic ...
VSV: L2-Miss-Driven Variable Supply-Voltage Scaling for Low Power
Energy-efficient processor design is becoming moreand more important with technology scaling and with highperformance requirements. Supply-voltage scaling is anefficient way to reduce energy by lowering the operatingvoltage and the clock frequency of ...
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor
Single-event upsets from particle strikes have become akey challenge in microprocessor design. Techniques todeal with these transient faults exist, but come at a cost.Designers clearly require accurate estimates of processorerror rates to make ...
TLC: Transmission Line Caches
It is widely accepted that the disproportionate scalingof transistor and conventional on-chip interconnect performancepresents a major barrier to future high performancesystems. Previous research has focused on wire-centricdesigns that use parallelism, ...
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Wire delays continue to grow as the dominant component oflatency for large caches.A recent work proposed an adaptive,non-uniform cache architecture (NUCA) to manage large, on-chipcaches.By exploiting the variation in access time acrosswidely-spaced ...
Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches
High-performance caches statically pull up the bit-linesin all cache subarrays to optimize cache accesslatency. Unfortunately, such an architecture results in asignificant waste of energy in nanoscale CMOS implementationsdue to high leakage and bitline ...
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction
This paper proposes and evaluates single-ISA heterogeneousmulti-core architectures as a mechanism to reduceprocessor power dissipation. Our design incorporatesheterogeneous cores representing different points inthe power/performance design space; during ...
Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data
With power dissipation becoming an increasingly vexingproblem across many classes of computer systems, measuringpower dissipation of real, running systems has becomecrucial for hardware and software system research and design.Live power measurements are ...
Power-driven Design of Router Microarchitectures in On-chip Networks
As demand for bandwidth increases in systems-on-a-chipand chip multiprocessors, networks are fast replacing busesand dedicated wires as the pervasive interconnect fabric foron-chip communication. The tight delay requirements facedby on-chip networks ...
Optimum Power/Performance Pipeline Depth
The impact of pipeline length on both the power andperformance of a microprocessor is explored boththeoretically and by simulation. A theory is presented fora wide range of power/performance metrics, BIPSm/W.The theory shows that the more important ...
Processor Acceleration Through Automated Instruction Set Customization
Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meetthe growing performance and power demands of embeddedapplications. Hardware, in the form of new function units(or co-processors), and ...
The Reconfigurable Streaming Vector Processor (RSVPTM)
The need to process multimedia data places largecomputational demands on portable/embedded devices.These multimedia functions share commoncharacteristics: they are computationally intensive anddata-streaming, performing the same operation(s) onmany data ...
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice
On-ine Transaction Processing (OLTP) workloads arecrucial benchmarks for the design and analysis of serverprocessors. Typical cached configurations used byresearchers to simulate OLTP workloads are orders ofmagnitude smaller than the fully scaled ...
Generational Cache Management of Code Traces in Dynamic Optimization Systems
A dynamic optimizer is a runtime software system thatgroups a program's instruction sequences into traces, optimizesthose traces, stores the optimized traces in a software-basedcode cache, and then executes the optimized code inthe code cache. To ...
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System
Traditional software controlled data cache prefetching isoften ineffective due to the lack of runtime cache miss andmiss address information. To overcome this limitation, weimplement runtime data cache prefetching in the dynamicoptimization system ADORE ...
IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems
IA-32 Execution Layer (IA-32 EL) is a newtechnology that executes IA-32 applications onIntel® Itanium® processor family systems.Currently, support for IA-32 applications onItanium-based platforms is achieved usinghardware circuitry on the Itanium ...
LLVA: A Low-level Virtual Instruction Set Architecture
A virtual instruction set architecture (V-ISA) implementedvia a processor-specific software translation layercan provide great flexibility to processor designers. Recentexamples such as Crusoe and DAISY, however, haveused existing hardware instruction ...
Comparing Program Phase Detection Techniques
Detecting program phase changes accurately is an importantaspect of dynamically adaptable systems. Threedynamic program phase detection techniques are compared- using instruction working sets, basic block vectors(BBV), and conditional branch counts. ...
Using Interaction Costs for Microarchitectural Bottleneck Analysis
Attacking bottlenecks in modern processors is difficultbecause many microarchitectural events overlap witheach other. This parallelism makes it difficult to both(a) assign a cost to an event (e.g., to one of two overlappingcache misses) and (b) assign ...
Fast Path-Based Neural Branch Prediction
Microarchitectural prediction based on neural learninghas received increasing attention in recent years. However,neural prediction remains impractical because its superioraccuracy over conventional predictors is not enough to offsetthe cost imposed by ...
Hardware Support for Control Transfers in Code Caches
Many dynamic optimization and/or binary translationsystems hold optimized/translated superblocks in a codecache. Conventional code caching systems suffer fromoverheads when control is transferred from one cachedsuperblock to another, especially via ...
Exploiting Value Locality in Physical Register Files
The physical register file is an important component of adynamically-scheduled processor. Increasing the amount of parallelismplaces increasing demands on the physical register file,calling for alternative file organization and management ...
Macro-op Scheduling: Relaxing Scheduling Loop Constraints
Ensuring back-to-back execution of dependent instructionsin a conventional out-of-order processor requiresscheduling logic that wakes up and selects instructions atthe same rate as they are executed. To sustain high performance,integer ALU instructions ...
WaveScalar
Silicon technology will continue to provide an exponential increasein the availability of raw transistors. Effectively translatingthis resource into application performance, however,is an open challenge. Ever increasing wire-delay relativeto switching ...
Index Terms
- Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture