A lightweight virtual machine monitor for Blue Gene/P
In this paper, we present a lightweight, micro-kernel-based virtual machine monitor (VMM) for the Blue Gene/P supercomputer. Our VMM comprises a small µ-kernel with virtualization capabilities and, atop, a user-level VMM component that manages virtual ...
OpenMP task scheduling strategies for multicore NUMA systems
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling ...
Virtual-machine-based emulation of future generation high-performance computing systems
- Patrick G Bridges,
- Dorian Arnold,
- Kevin T Pedretti,
- Madhav Suresh,
- Feng Lu,
- Peter Dinda,
- Russ Joseph,
- Jack Lange
This paper describes the design of a system to enable research, development, and testing of new software stacks and hardware features for future high-end computing systems. Motivating uses include both small-scale research and development on simulated ...
Linux kernel co-scheduling and bulk synchronous parallelism
This paper describes a kernel scheduling algorithm that is based on co-scheduling principles and that is intended for parallel applications running on 1000 cores or more. Experimental results for a Linux implementation on a Cray XT5 machine are ...
Large-scale fast Fourier transform on a heterogeneous multi-core system
As interest in hybrid computing systems increases, people are eager to find new ways to exploit the unique and efficient computational power of the heterogeneous multi-core systems. Although there has been much interest in implementing high-performance ...
Network-theoretic classification of parallel computation patterns
Parallel computation in a high-performance computing environment can be characterized by the distributed memory access patterns of the underlying algorithm. During execution, networks of compute nodes exchange messages that indirectly exhibit these ...
Characterization and transformation of unstructured control flow in bulk synchronous GPU applications
In this paper we identify important classes of program control flows in applications targeted to commercially available graphics processing units (GPUs) and characterize their presence in real workloads such as those that occur in CUDA and OpenCL. ...