The case for a single-chip multiprocessor
Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows ...
An evaluation of memory consistency models for shared-memory systems with ILP processors
Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP)...
Synchronization and communication in the T3E multiprocessor
This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale ...
Evaluation of architectural support for global address-based communication in large-scale parallel machines
- Arvind Krishnamurthy,
- Klaus E. Schauser,
- Chris J. Scheiman,
- Randolph Y. Wang,
- David E. Culler,
- Katherine Yelick
Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our ...
Whole-program optimization for time and space efficient threads
Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication ...
Thread scheduling for cache locality
This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache ...
The Rio file cache: surviving operating system crashes
One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, ...
Petal: distributed virtual disks
The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system ...
A quantitative analysis of loop nest locality
This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast ...
The intrinsic bandwidth requirements of ordinary programs
While there has been an abundance of recent papers on hardware and software approaches to improving the performance of memory accesses, few papers have addressed the problem from the program's point of view. There is a general notion that certain ...
Multiple-block ahead branch predictors
A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the ...
Analysis of branch prediction via data compression
Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction ...
Value locality and load value prediction
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ...
The structure and performance of interpreters
- Theodore H. Romer,
- Dennis Lee,
- Geoffrey M. Voelker,
- Alec Wolman,
- Wayne A. Wong,
- Jean-Loup Baer,
- Brian N. Bershad,
- Henry M. Levy
Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are "slow," however, little has been documented about the performance of ...
Adapting to network and client variability via on-demand dynamic distillation
The explosive growth of the Internet and the proliferation of smart cellular phones and handheld wireless devices is widening an already large gap between Internet clients. Clients vary in their hardware resources, software sophistication, and quality ...
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that ...
An integrated compile-time/run-time software distributed shared memory system
On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into ...
Hiding communication latency and coherence overhead in software DSMs
In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic ...
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
Compiler-based prefetching for recursive data structures
Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, ...
Exploiting dual data-memory banks in digital signal processors
Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs ...
Compiler-directed page coloring for multiprocessors
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This ...
Reducing network latency using subpages in a global memory environment
- Hervé A. Jamrozik,
- Michael J. Feeley,
- Geoffrey M. Voelker,
- James Evans,
- Anna R. Karlin,
- Henry M. Levy,
- Mary K. Vernon
New high-speed networks greatly encourage the use of network memory as a cache for virtual memory and file pages, thereby reducing the need for disk access. Because pages are the fundamental transfer and access units in remote memory systems, page size ...
Improving cache performance with balanced tag and data paths
There are two concurrent paths in a typical cache access --- one through the data array and the other through the tag array. The path through the data array drives the selected set out of the array. The path through the tag array determines cache hit/...
Operating system support for improving data locality on CC-NUMA compute servers
The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ...