SIGPLAN: Vol 31, No 9

Volume 31, Issue 9Sept. 1996

Volume 31, Issue 9

Sept. 1996

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:0362-1340

EISSN:1558-1160

Tags:

Bibliometrics

Select All

Export Citations Save to Binder

article

Free

The case for a single-chip multiprocessor

Pages 2–11https://doi.org/10.1145/248209.237140

Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows ...

article

Free

An evaluation of memory consistency models for shared-memory systems with ILP processors

Pages 12–23https://doi.org/10.1145/248209.237142

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP)...

article

Free

Synchronization and communication in the T3E multiprocessor

Steven L. Scott

Pages 26–36https://doi.org/10.1145/248209.237144

This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale ...

article

Free

Evaluation of architectural support for global address-based communication in large-scale parallel machines

Pages 37–48https://doi.org/10.1145/248209.237147

Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our ...

article

Free

Whole-program optimization for time and space efficient threads

Pages 50–59https://doi.org/10.1145/248209.237149

Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication ...

article

Free

Thread scheduling for cache locality

Pages 60–71https://doi.org/10.1145/248209.237151

This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache ...

article

Free

The Rio file cache: surviving operating system crashes

Pages 74–83https://doi.org/10.1145/248209.237154

One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, ...

article

Free

Petal: distributed virtual disks

Pages 84–92https://doi.org/10.1145/248209.237157

The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system ...

article

Free

A quantitative analysis of loop nest locality

Pages 94–104https://doi.org/10.1145/248209.237161

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...

article

Free

The intrinsic bandwidth requirements of ordinary programs

Pages 105–114https://doi.org/10.1145/248209.237163

While there has been an abundance of recent papers on hardware and software approaches to improving the performance of memory accesses, few papers have addressed the problem from the program's point of view. There is a general notion that certain ...

article

Free

Multiple-block ahead branch predictors

Pages 116–127https://doi.org/10.1145/248209.237169

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the ...

article

Free

Analysis of branch prediction via data compression

Pages 128–137https://doi.org/10.1145/248209.237171

Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction ...

article

Free

Value locality and load value prediction

Pages 138–147https://doi.org/10.1145/248209.237173

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ...

article

Free

The structure and performance of interpreters

Pages 150–159https://doi.org/10.1145/248209.237175

Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are "slow," however, little has been documented about the performance of ...

article

Free

Adapting to network and client variability via on-demand dynamic distillation

Pages 160–170https://doi.org/10.1145/248209.237177

The explosive growth of the Internet and the proliferation of smart cellular phones and handheld wireless devices is widening an already large gap between Internet clients. Clients vary in their hardware resources, software sophistication, and quality ...

article

Free

Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Pages 174–185https://doi.org/10.1145/248209.237179

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that ...

article

Free

An integrated compile-time/run-time software distributed shared memory system

Pages 186–197https://doi.org/10.1145/248209.237181

On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into ...

article

Free

Hiding communication latency and coherence overhead in software DSMs

Pages 198–209https://doi.org/10.1145/248209.237185

In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic ...

article

Free

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Pages 210–220https://doi.org/10.1145/248209.237187

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...

article

Free

Compiler-based prefetching for recursive data structures

Pages 222–233https://doi.org/10.1145/248209.237190

Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, ...

article

Free

Exploiting dual data-memory banks in digital signal processors

Pages 234–243https://doi.org/10.1145/248209.237193

Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs ...

article

Free

Compiler-directed page coloring for multiprocessors

Pages 244–255https://doi.org/10.1145/248209.237195

This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This ...

article

Free

Reducing network latency using subpages in a global memory environment

Pages 258–267https://doi.org/10.1145/248209.237198

New high-speed networks greatly encourage the use of network memory as a cache for virtual memory and file pages, thereby reducing the need for disk access. Because pages are the fundamental transfer and access units in remote memory systems, page size ...

article

Free

Improving cache performance with balanced tag and data paths

Pages 268–278https://doi.org/10.1145/248209.237202

There are two concurrent paths in a typical cache access --- one through the data array and the other through the tag array. The path through the data array drives the selected set out of the array. The path through the tag array determines cache hit/...

article

Free

Operating system support for improving data locality on CC-NUMA compute servers

Pages 279–289https://doi.org/10.1145/248209.237205

The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ...

Subjects

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Sections

Save to Binder

Subjects

Comments