Nothing Special   »   [go: up one dir, main page]

skip to main content
Volume 31, Issue 9Sept. 1996
Reflects downloads up to 21 Nov 2024Bibliometrics
Skip Table Of Content Section
article
Free
The case for a single-chip multiprocessor

Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows ...

article
Free
An evaluation of memory consistency models for shared-memory systems with ILP processors

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP)...

article
Free
Synchronization and communication in the T3E multiprocessor

This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale ...

article
Free
Evaluation of architectural support for global address-based communication in large-scale parallel machines

Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our ...

article
Free
Whole-program optimization for time and space efficient threads

Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication ...

article
Free
Thread scheduling for cache locality

This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache ...

article
Free
The Rio file cache: surviving operating system crashes

One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, ...

article
Free
Petal: distributed virtual disks

The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system ...

article
Free
A quantitative analysis of loop nest locality

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...

article
Free
The intrinsic bandwidth requirements of ordinary programs

While there has been an abundance of recent papers on hardware and software approaches to improving the performance of memory accesses, few papers have addressed the problem from the program's point of view. There is a general notion that certain ...

article
Free
Multiple-block ahead branch predictors

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the ...

article
Free
Analysis of branch prediction via data compression

Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction ...

article
Free
Value locality and load value prediction

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ...

article
Free
The structure and performance of interpreters

Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are "slow," however, little has been documented about the performance of ...

article
Free
Adapting to network and client variability via on-demand dynamic distillation

The explosive growth of the Internet and the proliferation of smart cellular phones and handheld wireless devices is widening an already large gap between Internet clients. Clients vary in their hardware resources, software sophistication, and quality ...

article
Free
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that ...

article
Free
An integrated compile-time/run-time software distributed shared memory system

On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into ...

article
Free
Hiding communication latency and coherence overhead in software DSMs

In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic ...

article
Free
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...

article
Free
Compiler-based prefetching for recursive data structures

Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, ...

article
Free
Exploiting dual data-memory banks in digital signal processors

Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs ...

article
Free
Compiler-directed page coloring for multiprocessors

This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This ...

article
Free
Reducing network latency using subpages in a global memory environment

New high-speed networks greatly encourage the use of network memory as a cache for virtual memory and file pages, thereby reducing the need for disk access. Because pages are the fundamental transfer and access units in remote memory systems, page size ...

article
Free
Improving cache performance with balanced tag and data paths

There are two concurrent paths in a typical cache access --- one through the data array and the other through the tag array. The path through the data array drives the selected set out of the array. The path through the tag array determines cache hit/...

article
Free
Operating system support for improving data locality on CC-NUMA compute servers

The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ...

Subjects

Comments

Please enable JavaScript to view thecomments powered by Disqus.