Astrophysical particle simulations with large custom GPU clusters on three continents
- R. Spurzem,
- P. Berczik,
- I. Berentzen,
- K. Nitadori,
- T. Hamada,
- G. Marcus,
- A. Kugel,
- R. Männer,
- J. Fiestas,
- R. Banerjee,
- R. Klessen
We present direct astrophysical N-body simulations with up to six million bodies using our parallel MPI-CUDA code on large GPU clusters in Beijing, Berkeley, and Heidelberg, with different kinds of GPU hardware. The clusters are linked in the ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Simulation of bevel gear cutting with GPGPUs--performance and productivity
- Sandra Wienke,
- Dmytro Plotnikov,
- Dieter Mey,
- Christian Bischof,
- Ario Hardjosuwito,
- Christof Gorgels,
- Christian Brecher
The desire for general purpose computation on graphics processing units caused the advance of new programming paradigms, e.g. OpenCL C/C++, CUDA C or the PGI Accelerator Model. In this paper, we apply these programming approaches to the software ...
Predictive analysis of a hydrodynamics application on large-scale CMP clusters
We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the ...
Shared-memory, distributed-memory, and mixed-mode parallelisation of a CFD simulation code
This paper presents some different approaches to the parallelisation of a harmonic balance Navier-Stokes solver for unsteady aerodynamics. Such simulation codes can require very large amounts of computational resource for realistic simulations, and ...
Wavelet-based adaptive multi-resolution solver on heterogeneous parallel architecture for computational fluid dynamics
For the efficient simulation of fluid flows governed by a wide range of scales a wavelet-based adaptive multi-resolution solver on heterogeneous parallel architectures is proposed for computational fluid dynamics. Both data- and task-based parallelisms ...
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
Designing and dynamically load balancing hybrid LU for multi/many-core
Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...
Scalable parallel AMG on ccNUMA machines with OpenMP
In many numerical simulation codes the backbone of the application covers the solution of linear systems of equations. Often, being created via a discretization of differential equations, the corresponding matrices are very sparse. One popular way to ...
Unbalanced tree search on a manycore system using the GPI programming model
The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous ...
High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT
Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (...
Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems
For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. ...
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and ...
The development of Mellanox/NVIDIA GPUDirect over InfiniBand--a new model for GPU to GPU communications
- Gilad Shainer,
- Ali Ayoub,
- Pak Lui,
- Tong Liu,
- Michael Kagan,
- Christian R. Trott,
- Greg Scantlen,
- Paul S. Crozier
The usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can ...
A system level view of Petascale I/O on IBM Blue Gene/P
Petascale supercomputers rely on highly efficient Petascale I/O subsystems. This work describes the tuning and scaling behavior of the GPFS parallel file system on JUGENE, the largest IBM Blue Gene/P installation worldwide and the first PetaFlop/s HPC ...
Baler: deterministic, lossless log message clustering tool
The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, ...
Fault oblivious high performance computing with dynamic task replication and substitution
Traditional parallel programming techniques will suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. To address this challenge, we ...
Ultra low latency market data feed on IBM PowerENTM
Financial Market IT solutions increasingly depend on ultra low latency message processing and target microseconds latencies in order to provide traders with a competitive advantages over their peers. Some solutions are available on the market, ranging ...
A system architecture supporting high-performance and cloud computing in an academic consortium environment
The University of Colorado (CU) and the National Center for Atmospheric Research (NCAR) have been deploying complimentary and federated resources supporting computational science in the Western United States since 2004. This activity has expanded to ...
Experiments with the Fresh Breeze tree-based memory model
The Fresh Breeze memory model and system architecture is proposed as an approach to achieving significant improvements in massively parallel computation by supporting fine-grain management of memory and processing resources and utilizing a global shared ...