ABSTRACT In this paper we present an application for a parallel multigrid solver in 3D to solve t... more ABSTRACT In this paper we present an application for a parallel multigrid solver in 3D to solve the Coulomb problem for the charge self interaction in a quantum-chemical program used to perform ab initio molecular dynamics. Techniques such as Mehrstellendiscretization and τ-extrapolation are used to improve the discretization error. The results show that the expected convergence rates and parallel performance of the multigrid solver are achieved. Within the applied Carr–Parrinello Molecular Dynamics scheme the quality of the solution also determines the accuracy in energy conservation. All forms of discretization employed lead to energy conserving dynamics. In order to test the applicability of our code to larger systems in a massively parallel environment, we investigated a 256 atom periodic supercell of bulk gallium nitride.
Sustaining a large fraction of single GPU performance in parallel computations is considered to b... more Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.
ABSTRACT Patch-based approaches in imaging require heavy computations on many small sub-blocks of... more ABSTRACT Patch-based approaches in imaging require heavy computations on many small sub-blocks of images but are easily parallelizable since usually different sub-blocks can be treated independently. In order to make these approaches useful in practical applications efficient algorithms and implementations are required. Newer architectures like the Cell Broadband Engine Architecture (CBEA) make it even possible to come close to real-time performance for moderate image sizes. In this article we present performance results for image denoising on the CBEA. The image denoising is done by finding sparse representations of signals from a given overcomplete dictionary and assuming that noise cannot be represented sparsely. We compare our results with a standard multicore implementation and show the gain of the CBEA.
International Journal of Parallel, Emergent and Distributed Systems, 2013
ABSTRACT In this paper, we describe an interactive real-time simulation of granular, spherical pa... more ABSTRACT In this paper, we describe an interactive real-time simulation of granular, spherical particles which is able to run on a single workstation. The simulation is based on a discrete element method approach and fully implemented using Open Computing Language, enabling execution on CPUs and GPUs alike. The simulation results are visualised using DirectX 10 and instancing. Furthermore, we enable the user to control the visualisation and the simulation in a very intuitive way by supporting user tracking and speech recognition, both using the Microsoft Kinect sensor. We also compare the performance of different implementation strategies on both CPUs and GPUs, and, as a sample application, we simulate the Brazil nut effect.
Three-dimensional (3-D) reconstruction of histological slice sequences offers great benefits in t... more Three-dimensional (3-D) reconstruction of histological slice sequences offers great benefits in the investigation of different morphologies. It features very high-resolution which is still unmatched by in vivo 3-D imaging modalities, and tissue staining further enhances visibility and contrast. One important step during reconstruction is the reversal of slice deformations introduced during histological slice preparation, a process also called image unwarping. Most methods use an external reference, or rely on conservative stopping criteria during the unwarping optimization to prevent straightening of naturally curved morphology. Our approach shows that the problem of unwarping is based on the superposition of low-frequency anatomy and high-frequency errors. We present an iterative scheme that transfers the ideas of the Gauss-Seidel method to image stacks to separate the anatomy from the deformation. In particular, the scheme is universally applicable without restriction to a specific unwarping method, and uses no external reference. The deformation artifacts are effectively reduced in the resulting histology volumes, while the natural curvature of the anatomy is preserved. The validity of our method is shown on synthetic data, simulated histology data using a CT data set and real histology data. In the case of the simulated histology where the ground truth was known, the mean Target Registration Error (TRE) between the unwarped and original volume could be reduced to less than 1 pixel on average after six iterations of our proposed method.
ABSTRACT In this paper we present an application for a parallel multigrid solver in 3D to solve t... more ABSTRACT In this paper we present an application for a parallel multigrid solver in 3D to solve the Coulomb problem for the charge self interaction in a quantum-chemical program used to perform ab initio molecular dynamics. Techniques such as Mehrstellendiscretization and τ-extrapolation are used to improve the discretization error. The results show that the expected convergence rates and parallel performance of the multigrid solver are achieved. Within the applied Carr–Parrinello Molecular Dynamics scheme the quality of the solution also determines the accuracy in energy conservation. All forms of discretization employed lead to energy conserving dynamics. In order to test the applicability of our code to larger systems in a massively parallel environment, we investigated a 256 atom periodic supercell of bulk gallium nitride.
Sustaining a large fraction of single GPU performance in parallel computations is considered to b... more Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.
ABSTRACT Patch-based approaches in imaging require heavy computations on many small sub-blocks of... more ABSTRACT Patch-based approaches in imaging require heavy computations on many small sub-blocks of images but are easily parallelizable since usually different sub-blocks can be treated independently. In order to make these approaches useful in practical applications efficient algorithms and implementations are required. Newer architectures like the Cell Broadband Engine Architecture (CBEA) make it even possible to come close to real-time performance for moderate image sizes. In this article we present performance results for image denoising on the CBEA. The image denoising is done by finding sparse representations of signals from a given overcomplete dictionary and assuming that noise cannot be represented sparsely. We compare our results with a standard multicore implementation and show the gain of the CBEA.
International Journal of Parallel, Emergent and Distributed Systems, 2013
ABSTRACT In this paper, we describe an interactive real-time simulation of granular, spherical pa... more ABSTRACT In this paper, we describe an interactive real-time simulation of granular, spherical particles which is able to run on a single workstation. The simulation is based on a discrete element method approach and fully implemented using Open Computing Language, enabling execution on CPUs and GPUs alike. The simulation results are visualised using DirectX 10 and instancing. Furthermore, we enable the user to control the visualisation and the simulation in a very intuitive way by supporting user tracking and speech recognition, both using the Microsoft Kinect sensor. We also compare the performance of different implementation strategies on both CPUs and GPUs, and, as a sample application, we simulate the Brazil nut effect.
Three-dimensional (3-D) reconstruction of histological slice sequences offers great benefits in t... more Three-dimensional (3-D) reconstruction of histological slice sequences offers great benefits in the investigation of different morphologies. It features very high-resolution which is still unmatched by in vivo 3-D imaging modalities, and tissue staining further enhances visibility and contrast. One important step during reconstruction is the reversal of slice deformations introduced during histological slice preparation, a process also called image unwarping. Most methods use an external reference, or rely on conservative stopping criteria during the unwarping optimization to prevent straightening of naturally curved morphology. Our approach shows that the problem of unwarping is based on the superposition of low-frequency anatomy and high-frequency errors. We present an iterative scheme that transfers the ideas of the Gauss-Seidel method to image stacks to separate the anatomy from the deformation. In particular, the scheme is universally applicable without restriction to a specific unwarping method, and uses no external reference. The deformation artifacts are effectively reduced in the resulting histology volumes, while the natural curvature of the anatomy is preserved. The validity of our method is shown on synthetic data, simulated histology data using a CT data set and real histology data. In the case of the simulated histology where the ground truth was known, the mean Target Registration Error (TRE) between the unwarped and original volume could be reduced to less than 1 pixel on average after six iterations of our proposed method.
Uploads
Papers by Harald Köstler