No abstract available.
The Cell Processor Architecture
This talk will present the Cell processor, jointly developed by the STI (Sony-Toshiba-IBM) partnership. Cell is a non-homogeneous chip multiprocessor intended for general-purpose applications but with a particular emphasis on multimedia performance. The ...
How to Fake 1000 Registers
Large numbers of logical registers can improve performance by allowing fast access to multiple subroutine contexts (register windows) and multiple thread contexts (multithreading). Support for both of these together requires a multiplicative number of ...
Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows
Instruction packing is a combination compiler/ architectural approach that allows for decreased code size, reduced power consumption and improved performance. The packing is obtained by placing frequently occurring instructions into an Instruction ...
Efficient Use of Invisible Registers in Thumb Code
The ARM processor is a dual width ISA processor that provides a 16-bit Thumb instruction set in addition to the 32-bit ARM instruction set. The compromises made in designing the Thumb instruction set leads to significantly increased instruction counts. ...
Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution
Predicated execution has been used to reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and additional data dependencies due to predicated execution sometimes offset the ...
A Criticality Analysis of Clustering in Superscalar Processors
Clustered machines partition hardware resources to circumvent the cycle time penalties incurred by large, monolithic structures. This partitioning introduces a long inter-cluster forwarding latency and the potential for load imbalance, both of which ...
Incremental Commit Groups for Non-Atomic Trace Processing
We introduce techniques to support efficient non-atomic execution of very long traces on a new binary translation based, x86-64 compatible VLIW microprocessor. Incrementally committed long traces significantly reduce wasted computations on exception ...
Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities
We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of ...
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor
Data prefetching via helper threading has been extensively investigated on Simultaneous Multi- Threading (SMT) or Virtual Multi-Threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most ...
Automatic Thread Extraction with Decoupled Software Pipelining
Until recently, a steadily rising clock rate and other uniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have ...
Exploiting Vector Parallelism in Software Pipelined Loops
An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast,...
Continuous Path and Edge Profiling
Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are dif- ficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow ...
Improving Region Selection in Dynamic Optimization Systems
The performance of a dynamic optimization system depends heavily on the code it selects to optimize. Many current systems follow the design of HP Dynamo and select a single interprocedural path, or trace, as the unit of code optimization and code ...
The Future Evolution of High-Performance Microprocessors
The evolution of high-performance microprocessors has reached several significant inflection points. First, the marginal utility of additional single-core complexity is now rapidly diminishing due to a number of factors. The increase in instructions per ...
Scalable Store-Load Forwarding via Store Queue Index Prediction
Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ ...
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store instructions speculatively and out-of-order prior to resolving their dependences. Whereas the ...
Store Memory-Level Parallelism Optimizations for Commercial Applications
This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. ...
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field ...
uComplexity: Estimating Processor Design Effort
Microprocessor design complexity is growing rapidly. As a result, current development costs for top of the line processors are staggering, and are doubling every 4 years. As we design ever larger and more complex processors, it is becoming increasingly ...
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System
Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or producing compact code. In the context of a hardware synthesis system where the schedule is used to determine various ...
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns
While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta (AVD) ...
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors
Checkpointed Early Resource Recycling (Cherry) is a recently-proposed micro-architectural technique that aims at improving critical resource utilization by performing aggressive resource recycling decoupled from instruction retirement, using a ...
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
As more data value speculation mechanisms are being proposed to speed-up processors, there is growing pressure on the critical processor structures that must buffer the state of the speculative instructions. A scalable solution is to checkpoint the ...
A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance
- Qiang Wu,
- Margaret Martonosi,
- Douglas W. Clark,
- V. J. Reddi,
- Dan Connors,
- Youfeng Wu,
- Jin Lee,
- David Brooks
Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS timeinterrupts, or static-compiler techniques. However, ...
Thermal Management of On-Chip Caches Through Power Density Minimization
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. However, these techniques mostly ignore the effects of temperature on the power consumption. In this paper, first we show that these power ...
Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines
Power density is a growing problem in high-performance processors in which small, high-activity resources overheat. Two categories of techniques, temporal and spatial, can address power density in a processor. Temporal solutions slow computation and ...