- Sponsor:
- sigarch
No abstract available.
The MIT Alewife machine: architecture and performance
- Anant Agarwal,
- Ricardo Bianchini,
- David Chaiken,
- Kirk L. Johnson,
- David Kranz,
- John Kubiatowicz,
- Beng-Hong Lim,
- Kenneth Mackenzie,
- Donald Yeung
Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, ...
The EM-X parallel computer: architecture and basic performance
Latency tolerance is essential in achieving high performance on parallel computers for remote function calls and fine-grained remote memory accesses. EM-X supports interprocessor communication on an execution pipeline with small and simple packets. It ...
The SPLASH-2 programs: characterization and methodological considerations
The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the ...
Efficient strategies for software-only protocols in shared-memory multiprocessors
The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important ...
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache ...
Boosting the performance of hybrid snooping cache protocols
Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. In this paper, we analyze ...
S-connect: from networks of workstations to supercomputer performance
S-Connect is a new high speed, scalable interconnect system that has been developed to support networks of workstations to efficiently share computing resources. It uses off-the-shelf CMOS technology to directly drive fiber-optic systems at speeds ...
Destage algorithms for disk arrays with non-volatile caches
In a disk array with a nonvolatile write cache, destages from the cache to the disk are performed in the background asynchronously while read requests from the host system are serviced in the foreground. In this paper, we study a number of algorithms ...
Evaluating multi-port frame buffer designs for a mesh-connected multicomputer
Multicomputers can be effectively used for interactive graphics rendering only if there are mechanisms available to rapidly composite and transfer images to an external display device. One method for achieving the necessary bandwidth for this operation ...
Are crossbars really dead?: the case for optical multiprocessor interconnect systems
Crossbar switches are rarely considered for large, scalable multiprocessor interconnect systems because they require O(n2) switching elements, are difficult to control efficiently and are hard to implement once their size becomes too large to fit on one ...
Exploring configurations of functional units in an out-of-order superscalar processor
This study has been carried out in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-...
Unconstrained speculative execution with predicated state buffering
Speculative execution is execution of instructions before it is known whether these instructions should be executed. Compiler-based speculative execution has the potential to achieve both a high instruction per cycle rate and high clock rate. Pure ...
A comparison of full and partial predicated execution support for ILP processors
One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support ...
Implementation trade-offs in using a restricted data flow architecture in a high performance RISC microprocessor
- M. Simone,
- A. Essen,
- A. Ike,
- A. Krishnamoorthy,
- T. Maruyama,
- N. Patkar,
- M. Ramaswami,
- M. Shebanow,
- V. Thirumalaiswamy,
- D. Tovey
The implementation of a superscalar, speculative execution SPARC-V9 microprocessor incorporating Restricted Data Flow principles required many design trade-offs. Consideration was given to both performance and cost. Performance is largely a function of ...
Performance evaluation of the PowerPC 620 microarchitecture
The PowerPC 620™ microprocessor is the most recent and performance leading member of the PowerPC™ family. The 64-bit PowerPC 620 microprocessor employs a two-phase branch prediction scheme, dynamic renaming for all the register files, ...
Reducing TLB and memory overhead using online superpage promotion
Modern microprocessors contain small TLBs that maintain a cache of recently used translations. A TLB's coverage is the sum of the number of bytes mapped by each entry. Applications with working sets larger than the TLB coverage will perform poorly due ...
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching
While many parallel applications exhibit good spatial locality, other important codes in areas like graph problem-solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently, while the former ...
An efficient, fully adaptive deadlock recovery scheme: DISHA
This paper presents a simple, efficient and cost effective routing strategy that considers deadlock recovery as opposed to prevention. Performance is optimized in the absence of deadlocks by allowing maximum flexibility in routing. Disha supports true ...
Analysis and implementation of hybrid switching
The switching scheme of a point-to-point network determines how packets flow through each node, and is a primary element in determining the network's performance. In this paper, we present and evaluate a new switching scheme called hybrid switching. ...
Configurable flow control mechanisms for fault-tolerant routing
Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in ...
NIFDY: a low overhead, high throughput network interface
In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are received by a processor in the order in which they were sent, even if the underlying network delivers the packets out of ...
Vector multiprocessors with arbitrated memory access
The high latency of memory accesses is one of the factors that most contribute to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in the case ...
Design of cache memories for multi-threaded dataflow architecture
Cache memories have proven their effectiveness in the von Neumann architecture when localities of reference govern the execution loci of programs. A pure dataflow program, in contrast, contains no locality of reference since the execution sequence is ...
Skewed associativity enhances performance predictability
Performance tuning becomes harder as computer technology advances. One of the factors is the increasing complexity of memory hierarchies. Most modern machines now use at least one level of cache memory. To reduce execution stalls, cache misses must be ...
A comparative analysis of schemes for correlated branch prediction
Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition ...
Next cache line and set prediction
Accurate instruction fetch and branch prediction is increasingly important on today's wide-issue architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of ...
A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D
Programming models based on messaging continue to be an important programming model for parallel machines. Messaging costs are strongly influenced by a machine's network interface architecture. We examine the impact of architectural support for ...
Optimizing memory system performance for communication in parallel computers
Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of ...
Empirical evaluation of the CRAY-T3D: a compiler perspective
Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The CRAY-T3D provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. ...
Optimization of instruction fetch mechanisms for high issue rates
Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the ...
Index Terms
- Proceedings of the 22nd annual international symposium on Computer architecture