Nothing Special   »   [go: up one dir, main page]

skip to main content
Reflects downloads up to 21 Nov 2024Bibliometrics
Skip Table Of Content Section
article
Astrophysical particle simulations with large custom GPU clusters on three continents

We present direct astrophysical N-body simulations with up to six million bodies using our parallel MPI-CUDA code on large GPU clusters in Beijing, Berkeley, and Heidelberg, with different kinds of GPU hardware. The clusters are linked in the ...

article
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...

article
Simulation of bevel gear cutting with GPGPUs--performance and productivity

The desire for general purpose computation on graphics processing units caused the advance of new programming paradigms, e.g. OpenCL C/C++, CUDA C or the PGI Accelerator Model. In this paper, we apply these programming approaches to the software ...

article
Predictive analysis of a hydrodynamics application on large-scale CMP clusters

We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the ...

article
Shared-memory, distributed-memory, and mixed-mode parallelisation of a CFD simulation code

This paper presents some different approaches to the parallelisation of a harmonic balance Navier-Stokes solver for unsteady aerodynamics. Such simulation codes can require very large amounts of computational resource for realistic simulations, and ...

article
Wavelet-based adaptive multi-resolution solver on heterogeneous parallel architecture for computational fluid dynamics

For the efficient simulation of fluid flows governed by a wide range of scales a wavelet-based adaptive multi-resolution solver on heterogeneous parallel architectures is proposed for computational fluid dynamics. Both data- and task-based parallelisms ...

article
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...

article
Designing and dynamically load balancing hybrid LU for multi/many-core

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...

article
Scalable parallel AMG on ccNUMA machines with OpenMP

In many numerical simulation codes the backbone of the application covers the solution of linear systems of equations. Often, being created via a discretization of differential equations, the corresponding matrices are very sparse. One popular way to ...

article
Unbalanced tree search on a manycore system using the GPI programming model

The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous ...

article
High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Three-dimensional FFT is an important component of many scientific computing applications ranging from fluid dynamics, to astrophysics and molecular dynamics. P3DFFT is a widely used three-dimensional FFT package. It uses the Message Passing Interface (...

article
Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. ...

article
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and ...

article
The development of Mellanox/NVIDIA GPUDirect over InfiniBand--a new model for GPU to GPU communications

The usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can ...

article
A system level view of Petascale I/O on IBM Blue Gene/P

Petascale supercomputers rely on highly efficient Petascale I/O subsystems. This work describes the tuning and scaling behavior of the GPFS parallel file system on JUGENE, the largest IBM Blue Gene/P installation worldwide and the first PetaFlop/s HPC ...

article
Baler: deterministic, lossless log message clustering tool

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, ...

article
Fault oblivious high performance computing with dynamic task replication and substitution

Traditional parallel programming techniques will suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. To address this challenge, we ...

article
Ultra low latency market data feed on IBM PowerENTM

Financial Market IT solutions increasingly depend on ultra low latency message processing and target microseconds latencies in order to provide traders with a competitive advantages over their peers. Some solutions are available on the market, ranging ...

article
A system architecture supporting high-performance and cloud computing in an academic consortium environment

The University of Colorado (CU) and the National Center for Atmospheric Research (NCAR) have been deploying complimentary and federated resources supporting computational science in the Western United States since 2004. This activity has expanded to ...

article
Experiments with the Fresh Breeze tree-based memory model

The Fresh Breeze memory model and system architecture is proposed as an approach to achieving significant improvements in massively parallel computation by supporting fine-grain management of memory and processing resources and utilizing a global shared ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.