Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

This talk will present the Cell processor, jointly developed by the STI (Sony-Toshiba-IBM) partnership. Cell is a non-homogeneous chip multiprocessor intended for general-purpose applications but with a particular emphasis on multimedia performance. The ...

Article

How to Fake 1000 Registers

Pages 7–18https://doi.org/10.1109/MICRO.2005.21

Large numbers of logical registers can improve performance by allowing fast access to multiple subroutine contexts (register windows) and multiple thread contexts (multithreading). Support for both of these together requires a multiplicative number of ...

Article

Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows

Pages 19–29https://doi.org/10.1109/MICRO.2005.27

Instruction packing is a combination compiler/ architectural approach that allows for decreased code size, reduced power consumption and improved performance. The packing is obtained by placing frequently occurring instructions into an Instruction ...

Article

Efficient Use of Invisible Registers in Thumb Code

Pages 30–42https://doi.org/10.1109/MICRO.2005.19

The ARM processor is a dual width ISA processor that provides a 16-bit Thumb instruction set in addition to the 32-bit ARM instruction set. The compromises made in designing the Thumb instruction set leads to significantly increased instruction counts. ...

Article

Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution

Pages 43–54https://doi.org/10.1109/MICRO.2005.38

Predicated execution has been used to reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and additional data dependencies due to predicated execution sometimes offset the ...

Article

A Criticality Analysis of Clustering in Superscalar Processors

Pages 55–66https://doi.org/10.1109/MICRO.2005.6

Clustered machines partition hardware resources to circumvent the cycle time penalties incurred by large, monolithic structures. This partitioning introduces a long inter-cluster forwarding latency and the potential for load imbalance, both of which ...

Article

Incremental Commit Groups for Non-Atomic Trace Processing

Pages 67–80https://doi.org/10.1109/MICRO.2005.23

We introduce techniques to support efficient non-atomic execution of very long traces on a new binary translation based, x86-64 compatible VLIW microprocessor. Incrementally committed long traces significantly reduce wasted computations on exception ...

Article

Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities

Pages 81–92https://doi.org/10.1109/MICRO.2005.26

We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of ...

Article

Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Pages 93–104https://doi.org/10.1109/MICRO.2005.18

Data prefetching via helper threading has been extensively investigated on Simultaneous Multi- Threading (SMT) or Virtual Multi-Threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most ...

Article

Automatic Thread Extraction with Decoupled Software Pipelining

Pages 105–118https://doi.org/10.1109/MICRO.2005.13

Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have ...

Article

Exploiting Vector Parallelism in Software Pipelined Loops

Pages 119–129https://doi.org/10.1109/MICRO.2005.20

An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast,...

Article

Continuous Path and Edge Profiling

Pages 130–140https://doi.org/10.1109/MICRO.2005.16

Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are dif- ficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow ...

Article

Improving Region Selection in Dynamic Optimization Systems

Pages 141–154https://doi.org/10.1109/MICRO.2005.22

The performance of a dynamic optimization system depends heavily on the code it selects to optimize. Many current systems follow the design of HP Dynamo and select a single interprocedural path, or trace, as the unit of code optimization and code ...

Article

The Future Evolution of High-Performance Microprocessors

Norm Jouppi

Page 155https://doi.org/10.1109/MICRO.2005.34

The evolution of high-performance microprocessors has reached several significant inflection points. First, the marginal utility of additional single-core complexity is now rapidly diminishing due to a number of factors. The increase in instructions per ...

Article

Scalable Store-Load Forwarding via Store Queue Index Prediction

Pages 159–170https://doi.org/10.1109/MICRO.2005.29

Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ ...

Article

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Pages 171–182https://doi.org/10.1109/MICRO.2005.10

This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store instructions speculatively and out-of-order prior to resolving their dependences. Whereas the ...

Article

Store Memory-Level Parallelism Optimizations for Commercial Applications

Pages 183–196https://doi.org/10.1109/MICRO.2005.31

This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. ...

Article

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Pages 197–208https://doi.org/10.1109/MICRO.2005.8

We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field ...

Article

uComplexity: Estimating Processor Design Effort

Pages 209–218https://doi.org/10.1109/MICRO.2005.37

Microprocessor design complexity is growing rapidly. As a result, current development costs for top of the line processors are staggering, and are doubling every 4 years. As we design ever larger and more complex processors, it is becoming increasingly ...

Article

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Pages 219–232https://doi.org/10.1109/MICRO.2005.17

Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or producing compact code. In the context of a hardware synthesis system where the schedule is used to determine various ...

Article

Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Pages 233–244https://doi.org/10.1109/MICRO.2005.11

While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta (AVD) ...

Article

Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Pages 245–256https://doi.org/10.1109/MICRO.2005.15

Checkpointed Early Resource Recycling (Cherry) is a recently-proposed micro-architectural technique that aims at improving critical resource utilization by performing aggressive resource recycling decoupled from instruction retirement, using a ...

Article

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

Pages 257–270https://doi.org/10.1109/MICRO.2005.28

As more data value speculation mechanisms are being proposed to speed-up processors, there is growing pressure on the critical processor structures that must buffer the state of the speculative instructions. A scalable solution is to checkpoint the ...

Article

A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance

Pages 271–282https://doi.org/10.1109/MICRO.2005.7

Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS timeinterrupts, or static-compiler techniques. However, ...

Article

Thermal Management of On-Chip Caches Through Power Density Minimization

Pages 283–293https://doi.org/10.1109/MICRO.2005.36

Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. However, these techniques mostly ignore the effects of temperature on the power consumption. In this paper, first we show that these power ...

Article

Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines

Pages 294–304https://doi.org/10.1109/MICRO.2005.14

Power density is a growing problem in high-performance processors in which small, high-activity resources overheat. Two categories of techniques, temporal and spatial, can address power density in a processor. Temporal solutions slow computation and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Acceptance Rates

MICRO 38 Paper Acceptance Rate 29 of 147 submissions, 20%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Year	Submitted	Accepted	Rate
MICRO-48	283	61	22%
MICRO-47	279	53	19%
MICRO-46	239	39	16%
MICRO 41	210	40	19%
MICRO 40	166	35	21%
MICRO 39	174	42	24%
MICRO 38	147	29	20%
MICRO 37	158	29	18%
MICRO 36	134	35	26%
MICRO 33	110	31	28%
MICRO 32	131	27	21%
MICRO 31	108	28	26%
MICRO 30	103	35	34%
Overall	2,242	484	22%

MICRO

Sections

38th Annual IEEE/ACM International Symposium on Microarchitecture - Title Page

38th Annual IEEE/ACM International Symposium on Microarchitecture - Copyright

Message from the General Chairs

Message from the Program Co-Chairs

The Cell Processor Architecture

How to Fake 1000 Registers

Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows

Efficient Use of Invisible Registers in Thumb Code

Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution

A Criticality Analysis of Clustering in Superscalar Processors

Incremental Commit Groups for Non-Atomic Trace Processing

Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities

Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Automatic Thread Extraction with Decoupled Software Pipelining

Exploiting Vector Parallelism in Software Pipelined Loops

Continuous Path and Edge Profiling

Improving Region Selection in Dynamic Optimization Systems

The Future Evolution of High-Performance Microprocessors

Scalable Store-Load Forwarding via Store Queue Index Prediction

Address-Indexed Memory Disambiguation and Store-to-Load Forwarding

Store Memory-Level Parallelism Optimizations for Commercial Applications

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

uComplexity: Estimating Processor Design Effort

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance

Thermal Management of On-Chip Caches Through Power Density Minimization

Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines

LICS '20: Proceedings of the 35th Annual ACM/IEEE Symposium on Logic in Computer Science

LICS '16: Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science

CSL-LICS '14: Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS)

Acceptance Rates