Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/223982acmconferencesBook PagePublication PagesiscaConference Proceedingsconference-collections
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
ACM1995 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
ISCA95: International Conference on Computer Architecture S. Margherita Ligure Italy June 22 - 24, 1995
ISBN:
978-0-89791-698-1
Published:
01 July 1995
Sponsors:
SIGARCH, IEEE-CS\TCCA
Next Conference
Reflects downloads up to 26 Nov 2024Bibliometrics
Abstract

No abstract available.

Article
Free
The MIT Alewife machine: architecture and performance

Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, ...

Article
Free
The EM-X parallel computer: architecture and basic performance

Latency tolerance is essential in achieving high performance on parallel computers for remote function calls and fine-grained remote memory accesses. EM-X supports interprocessor communication on an execution pipeline with small and simple packets. It ...

Article
Free
The SPLASH-2 programs: characterization and methodological considerations

The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the ...

Article
Free
Efficient strategies for software-only protocols in shared-memory multiprocessors

The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important ...

Article
Free
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache ...

Article
Free
Boosting the performance of hybrid snooping cache protocols

Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. In this paper, we analyze ...

Article
Free
S-connect: from networks of workstations to supercomputer performance

S-Connect is a new high speed, scalable interconnect system that has been developed to support networks of workstations to efficiently share computing resources. It uses off-the-shelf CMOS technology to directly drive fiber-optic systems at speeds ...

Article
Free
Destage algorithms for disk arrays with non-volatile caches

In a disk array with a nonvolatile write cache, destages from the cache to the disk are performed in the background asynchronously while read requests from the host system are serviced in the foreground. In this paper, we study a number of algorithms ...

Article
Free
Evaluating multi-port frame buffer designs for a mesh-connected multicomputer

Multicomputers can be effectively used for interactive graphics rendering only if there are mechanisms available to rapidly composite and transfer images to an external display device. One method for achieving the necessary bandwidth for this operation ...

Article
Free
Are crossbars really dead?: the case for optical multiprocessor interconnect systems

Crossbar switches are rarely considered for large, scalable multiprocessor interconnect systems because they require O(n2) switching elements, are difficult to control efficiently and are hard to implement once their size becomes too large to fit on one ...

Article
Free
Exploring configurations of functional units in an out-of-order superscalar processor

This study has been carried out in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-...

Article
Free
Unconstrained speculative execution with predicated state buffering

Speculative execution is execution of instructions before it is known whether these instructions should be executed. Compiler-based speculative execution has the potential to achieve both a high instruction per cycle rate and high clock rate. Pure ...

Article
Free
A comparison of full and partial predicated execution support for ILP processors

One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support ...

Article
Free
Implementation trade-offs in using a restricted data flow architecture in a high performance RISC microprocessor

The implementation of a superscalar, speculative execution SPARC-V9 microprocessor incorporating Restricted Data Flow principles required many design trade-offs. Consideration was given to both performance and cost. Performance is largely a function of ...

Article
Free
Performance evaluation of the PowerPC 620 microarchitecture

The PowerPC 620™ microprocessor is the most recent and performance leading member of the PowerPC™ family. The 64-bit PowerPC 620 microprocessor employs a two-phase branch prediction scheme, dynamic renaming for all the register files, ...

Article
Free
Reducing TLB and memory overhead using online superpage promotion

Modern microprocessors contain small TLBs that maintain a cache of recently used translations. A TLB's coverage is the sum of the number of bytes mapped by each entry. Applications with working sets larger than the TLB coverage will perform poorly due ...

Article
Free
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

While many parallel applications exhibit good spatial locality, other important codes in areas like graph problem-solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently, while the former ...

Article
Free
An efficient, fully adaptive deadlock recovery scheme: DISHA

This paper presents a simple, efficient and cost effective routing strategy that considers deadlock recovery as opposed to prevention. Performance is optimized in the absence of deadlocks by allowing maximum flexibility in routing. Disha supports true ...

Article
Free
Analysis and implementation of hybrid switching

The switching scheme of a point-to-point network determines how packets flow through each node, and is a primary element in determining the network's performance. In this paper, we present and evaluate a new switching scheme called hybrid switching. ...

Article
Free
Configurable flow control mechanisms for fault-tolerant routing

Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in ...

Article
Free
NIFDY: a low overhead, high throughput network interface

In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are received by a processor in the order in which they were sent, even if the underlying network delivers the packets out of ...

Article
Free
Vector multiprocessors with arbitrated memory access

The high latency of memory accesses is one of the factors that most contribute to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in the case ...

Article
Free
Design of cache memories for multi-threaded dataflow architecture

Cache memories have proven their effectiveness in the von Neumann architecture when localities of reference govern the execution loci of programs. A pure dataflow program, in contrast, contains no locality of reference since the execution sequence is ...

Article
Free
Skewed associativity enhances performance predictability

Performance tuning becomes harder as computer technology advances. One of the factors is the increasing complexity of memory hierarchies. Most modern machines now use at least one level of cache memory. To reduce execution stalls, cache misses must be ...

Article
Free
A comparative analysis of schemes for correlated branch prediction

Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition ...

Article
Free
Next cache line and set prediction

Accurate instruction fetch and branch prediction is increasingly important on today's wide-issue architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of ...

Article
Free
A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D

Programming models based on messaging continue to be an important programming model for parallel machines. Messaging costs are strongly influenced by a machine's network interface architecture. We examine the impact of architectural support for ...

Article
Free
Optimizing memory system performance for communication in parallel computers

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of ...

Article
Free
Empirical evaluation of the CRAY-T3D: a compiler perspective

Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The CRAY-T3D provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. ...

Article
Free
Optimization of instruction fetch mechanisms for high issue rates

Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the ...

Contributors
  • Google LLC
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%
YearSubmittedAcceptedRate
ISCA '224006717%
ISCA '193656217%
ISCA '173225417%
ISCA '132885619%
ISCA '122624718%
ISCA '082593714%
ISCA '062343113%
ISCA '051944523%
ISCA '042173114%
ISCA '031843620%
ISCA '021802715%
ISCA '011632415%
ISCA '991352619%
Overall3,20354317%