- Sponsor:
- sigarch
No abstract available.
Execution-based prediction using speculative slices
A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be ...
Speculative precomputation: long-range prefetching of delinquent loads
- Jamison D. Collins,
- Hong Wang,
- Dean M. Tullsen,
- Christopher Hughes,
- Yong-Fong Lee,
- Dan Lavery,
- John P. Shen
This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future ...
Dynamically allocating processor resources between nearby and distant ILP
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP ...
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures ...
Data prefetching by dependence graph precomputation
Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular ...
Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance?
Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and ...
Focusing processor policies via critical-path prediction
Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we could ...
Automated design of finite state machine predictors for customized processors
Customized processors use compiler analysis and design automation techniques to take a generalized architectural model and create a specific instance of it which is optimized to a given application or set of applications. These processors offer the ...
Better exploration of region-level value locality with integrated computation reuse and value prediction
Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a ...
CryptoManiac: a fast flexible architecture for secure communication
The growth of the Internet as a vehicle for secure communication and electronic commerce has brought cryptographic processing performance to the forefront of high throughput system design. This trend will be further underscored with the widespread ...
QoS provisioning in clusters: an investigation of Router and NIC design
Design of high performance cluster networks (routers) with Quality-of-Service (QoS) guarantees is becoming increasingly important to support a variety of multimedia applications, many of which have real-time constraints. Most commercial routers, which ...
Locality vs. criticality
Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a ...
Dead-block prediction & dead-block correlating prefetchers
Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data ...
Code layout optimizations for transaction processing workloads
- Alex Ramirez,
- Luiz André Barroso,
- Kourosh Gharachorloo,
- Robert Cohn,
- Josep Larriba-Pey,
- P. Geoffrey Lowney,
- Mateo Valero
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for ...
Exploring and exploiting wire-level pipelining in emerging technologies
Pipelining is a technique that has long since been considered fundamental by computer architects. However, the world of nanoelectronics is pushing the idea of pipelining to new and lower levels — particularly the device level. How this affects circuits ...
NanoFabrics: spatial computing using molecular electronics
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A ...
A simple method for extracting models for protocol code
The use of model checking for validation requires that models of the underlying system be created. Creating such models is both difficult and error prone and as a result, verification is rarely used despite its advantages. In this paper, we present a ...
Removing architectural bottlenecks to the scalability of speculative parallelization
Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far ...
Power and energy reduction via pipeline balancing
Minimizing power dissipation is an important design requirement for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power. The technique, known as ...
Energy-effective issue logic
The issue logic of a dynamically-scheduled superscalar processor is a complex mechanism devoted to start the execution of multiple instructions every cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed by ...
Cache decay: exploiting generational behavior to reduce cache leakage power
Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also ...
Variability in the execution of multimedia applications and implications for architecture
Multimedia applications are an increasingly important workload for general-purpose processors. This paper analyzes frame-level execution time variability for several multimedia applications on general-purpose architectures. There are two reasons for ...
Measuring Experimental Error in Microprocessor Simulation
Abstract: We measure the experimental error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation- based studies. We describe the methodology that we used to validate ...
Rapid profiling via stratified sampling
Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the ...
Index Terms
- Proceedings of the 28th annual international symposium on Computer architecture