Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (250)

Search Parameters:
Keywords = CUDA

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 4958 KiB  
Article
An Efficient GPU-Accelerated Algorithm for Solving Dynamic Response of Fluid-Saturated Porous Media
by Wancang Lin, Qinglong Zhou, Xinyi Chen, Wenhao Shi and Jie Ai
Mathematics 2025, 13(2), 181; https://doi.org/10.3390/math13020181 - 7 Jan 2025
Viewed by 709
Abstract
The traditional finite element program is executed on the CPU; however, it is challenging for the CPU to compute the ultra-large scale finite element model. In this paper, we present a set of efficient algorithms based on GPU acceleration technology for the dynamic [...] Read more.
The traditional finite element program is executed on the CPU; however, it is challenging for the CPU to compute the ultra-large scale finite element model. In this paper, we present a set of efficient algorithms based on GPU acceleration technology for the dynamic response of fluid-saturated porous media, named PNAM, encompassing the assembly of the global matrix and the iterative solution of equations. In the assembly part, the CSR storage format of the global matrix is directly obtained from the element matrix. For data with two million degrees of freedom, it merely takes approximately 1 s to generate all the data of global matrices, which is significantly superior to the CPU version. Regarding the iterative solution of equations, a novel algorithm based on the CUDA kernel function is proposed. For a data set with two million degrees of freedom, it takes only about 0.05 s to compute an iterative step and transfer the data to the CPU. The program is designed to calculate either in single or double precision. The change in precision has little impact on the assembly of the global matrix, but the calculation time of double precision is generally 1.5 to 2 times that of single precision in the iterative solution part for a model with 2 million degrees of freedom. PNAM has high computational efficiency and great compatibility, which can be used to solve not only saturated fluid problems but also a variety of other problems. Full article
Show Figures

Figure 1

Figure 1
<p>The flow chart of the system program.</p>
Full article ">Figure 2
<p>(<b>left</b>) The corresponding relationship between the planar four-node finite element model and its data file; (<b>right</b>) Element 2 and element 1 have two shared nodes whose element matrices are assembled into the global matrix with some shared parts.</p>
Full article ">Figure 3
<p>The flow chart of the preprocessed data generation through four steps.</p>
Full article ">Figure 4
<p>(<b>left</b>) Description of Hillis/Steele scan algorithm for a thread block; (<b>right</b>) Divides node_count into several thread blocks, and each thread block uses Hillis/Steele scan to obtain acc_ele_count0. Each number in every block adds the previous last number of each block and then obtains the acc_ele_count.</p>
Full article ">Figure 5
<p>The relationship between four preprocessed data.</p>
Full article ">Figure 6
<p>Generation of CSR_offset data for global stiffness matrix.</p>
Full article ">Figure 7
<p>Generation of CSR_offset data for global stiffness matrix (node 1).</p>
Full article ">Figure 8
<p>(<b>top</b>) Firstly, eleNloc was used to obtain the element and location of the node, and then the element stiffness matrix was calculated through data files of FEM and then the compared element stiffness matrix with node_list; (<b>bottom</b>) The location is used to ascertain the data in the element stiffness matrix to be written into the global stiffness matrix.</p>
Full article ">Figure 9
<p>Two-dimensional saturated soil model and load.</p>
Full article ">Figure 10
<p>The displacement and pore pressure time history curves of the three observation points. (<b>above</b>) Results of GPU calculation; (<b>below</b>) Results of CPU calculation.</p>
Full article ">Figure 11
<p>Influence of precision on algorithm (KF means kernel function).</p>
Full article ">Figure 12
<p>The effect of model size on algorithm (KF means kernel function).</p>
Full article ">
22 pages, 620 KiB  
Article
Ego-Motion Estimation for Autonomous Vehicles Based on Genetic Algorithms and CUDA Parallel Processing
by Abiel Aguilar-González and Alejandro Medina Santiago
Algorithms 2025, 18(1), 19; https://doi.org/10.3390/a18010019 - 3 Jan 2025
Viewed by 501
Abstract
Estimating ego-motion in autonomous vehicles is critical for tasks such as localization, navigation, obstacle avoidance, and so on. While traditional methods often rely on direct pose estimation or AI-based approaches, these can be computationally intensive, especially for small, incremental movements typically observed between [...] Read more.
Estimating ego-motion in autonomous vehicles is critical for tasks such as localization, navigation, obstacle avoidance, and so on. While traditional methods often rely on direct pose estimation or AI-based approaches, these can be computationally intensive, especially for small, incremental movements typically observed between consecutive frames. In this work, we propose a brute-force-based ego-motion estimation algorithm that takes advantage of the constraints of autonomous vehicles, which are assumed to have only three degrees of freedom (x, y, and yaw). Our approach is based on a genetic algorithm to efficiently explore potential vehicle movements. By generating an initial seed of random motion candidates and iteratively mutating and selecting the best-performing individuals, we minimize the cost function that measures image similarity between frames. Furthermore, we implement the algorithm using CUDA to exploit parallel processing, significantly improving computational speed. Experimental results demonstrate that our approach achieves accurate ego-motion estimation with high efficiency, making it suitable for real-time autonomous vehicle applications. Full article
(This article belongs to the Section Parallel and Distributed Algorithms)
Show Figures

Figure 1

Figure 1
<p>Block diagram of the proposed algorithm.</p>
Full article ">Figure 2
<p>Flow chart of the proposed algorithm.</p>
Full article ">Figure 3
<p>The algorithm’s performance was evaluated on the KITTI dataset’s training sequences, with results demonstrating consistently high accuracy across multiple test cases, closely matching the ground truth. These outcomes confirm that the model effectively captures vehicle motion and spatial consistency within diverse urban and suburban environments featured in KITTI. The blue line represents the ground truth, while the green line shows the estimated ego-motion using the proposed algorithm.</p>
Full article ">Figure 3 Cont.
<p>The algorithm’s performance was evaluated on the KITTI dataset’s training sequences, with results demonstrating consistently high accuracy across multiple test cases, closely matching the ground truth. These outcomes confirm that the model effectively captures vehicle motion and spatial consistency within diverse urban and suburban environments featured in KITTI. The blue line represents the ground truth, while the green line shows the estimated ego-motion using the proposed algorithm.</p>
Full article ">Figure 4
<p>The performance of the algorithm, sequences 11 to 14 from the KITTI dataset, without ground truth data. The results were obtained from the KITTI evaluation platform, where our algorithm was submitted for evaluation. These evaluations demonstrated that the proposed algorithm maintains a high level of precision across all tested scenarios.</p>
Full article ">Figure 4 Cont.
<p>The performance of the algorithm, sequences 11 to 14 from the KITTI dataset, without ground truth data. The results were obtained from the KITTI evaluation platform, where our algorithm was submitted for evaluation. These evaluations demonstrated that the proposed algorithm maintains a high level of precision across all tested scenarios.</p>
Full article ">Figure 5
<p>Performance for the pose estimation step under the proposed dataset. Sequence 48, which consists of an <span class="html-italic">x</span>, <span class="html-italic">y</span>, and yaw camera movement, is used to validate the performance under loop trajectories. By using the proposed algorithm, an accuracy of around 97.4% can be reached.</p>
Full article ">
21 pages, 4675 KiB  
Article
A Parallel Framework for Fast Charge/Discharge Scheduling of Battery Storage Systems in Microgrids
by Wei-Tzer Huang, Wu-Chun Chung, Chao-Chin Wu and Tse-Yun Huang
Energies 2024, 17(24), 6371; https://doi.org/10.3390/en17246371 - 18 Dec 2024
Viewed by 544
Abstract
Fast charge/discharge scheduling of battery storage systems is essential in microgrids to effectively balance variable renewable energy sources, meet fluctuating demand, and maintain grid stability. To achieve this, parallel processing is employed, allowing batteries to respond instantly to dynamic conditions. By managing the [...] Read more.
Fast charge/discharge scheduling of battery storage systems is essential in microgrids to effectively balance variable renewable energy sources, meet fluctuating demand, and maintain grid stability. To achieve this, parallel processing is employed, allowing batteries to respond instantly to dynamic conditions. By managing the complexity, high data volume, and rapid decision-making requirements in real time, parallel processing ensures that the microgrid operates with stability, efficiency, and safety. With the application of deep reinforcement learning (DRL) in scheduling algorithm design, the demand for computational power has further increased significantly. To address this challenge, we propose a Ray-based parallel framework to accelerate the development of fast charge/discharge scheduling for battery storage systems in microgrids. We demonstrate how to implement a real-world scheduling problem in the framework. We focused on minimizing power losses and reducing the ramping rate of net loads by leveraging the Asynchronous Advantage Actor Critic (A3C) algorithms and the features of the Ray cluster for real-time decision making. Multiple instances of OpenDSS were executed concurrently, with each instance simulating a distinct environment and efficiently processing input data. Additionally, Numba CUDA was utilized to facilitate GPU acceleration of shared memory, significantly enhancing the performance of the computationally intensive reward function in A3C. The proposed framework enhanced scheduling performance, enabling efficient energy management in complex, dynamic microgrid environments. Full article
(This article belongs to the Section A1: Smart Grids and Microgrids)
Show Figures

Figure 1

Figure 1
<p>The one-node simulation of power system with a single instance of OpenDSS.</p>
Full article ">Figure 2
<p>The A3C learning block.</p>
Full article ">Figure 3
<p>Proposed multi-node power system simulation with multiple parallel OpenDSS.</p>
Full article ">Figure 4
<p>The system architecture of the Ray framework.</p>
Full article ">Figure 5
<p>A schematic of the improved internal structure.</p>
Full article ">Figure 6
<p>The Numba CUDA kernel function of the A3C reward function.</p>
Full article ">Figure 7
<p>The integrated computing structure.</p>
Full article ">Figure 8
<p>A single-line diagram of the NCUE microgrid.</p>
Full article ">Figure 9
<p>Before and after the BESS schedule results in a peak load day; (<b>a</b>) daily net load and (<b>b</b>) daily charging and discharging schedule.</p>
Full article ">Figure 10
<p>Before and after the BESS schedule results in an off-peak load day: (<b>a</b>) daily net load and (<b>b</b>) daily charging and discharging schedule.</p>
Full article ">Figure 11
<p>Performance improvements of A3C using Numba and GPU shared memory.</p>
Full article ">Figure 12
<p>Moving average of rewards for the 2-node Ray cluster.</p>
Full article ">Figure 13
<p>Moving average of rewards for the 3-node Ray cluster.</p>
Full article ">
40 pages, 1079 KiB  
Article
Context-Adaptable Deployment of FastSLAM 2.0 on Graphic Processing Unit with Unknown Data Association
by Jessica Giovagnola, Manuel Pegalajar Cuéllar and Diego Pedro Morales Santos
Appl. Sci. 2024, 14(23), 11466; https://doi.org/10.3390/app142311466 - 9 Dec 2024
Viewed by 965
Abstract
Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient [...] Read more.
Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient performance. While SLAM solutions aim at ensuring accurate and timely localization and mapping, one of their main limitations is their computational complexity. In this scenario, particle filter-based approaches such as FastSLAM 2.0 can significantly benefit from parallel programming due to their modular construction. The parallelization process involves identifying the parameters affecting the computational complexity in order to distribute the computation among single multiprocessors as efficiently as possible. However, the computational complexity of methodologies such as FastSLAM 2.0 can depend on multiple parameters whose values may, in turn, depend on each specific use case scenario ( ingi.e., the context), leading to multiple possible parallelization designs. Furthermore, the features of the hardware architecture in use can significantly influence the performance in terms of latency. Therefore, the selection of the optimal parallelization modality still needs to be empirically determined. This may involve redesigning the parallel algorithm depending on the context and the hardware architecture. In this paper, we propose a CUDA-based adaptable design for FastSLAM 2.0 on GPU, in combination with an evaluation methodology that enables the assessment of the optimal parallelization modality based on the context and the hardware architecture without the need for the creation of separate designs. The proposed implementation includes the parallelization of all the functional blocks of the FastSLAM 2.0 pipeline. Additionally, we contribute a parallelized design of the data association step through the Joint Compatibility Branch and Bound (JCBB) method. Multiple resampling algorithms are also included to accommodate the needs of a wide variety of navigation scenarios. Full article
Show Figures

Figure 1

Figure 1
<p>FastSLAM 2.0 pipeline.</p>
Full article ">Figure 2
<p>Observation model—graphical representation.</p>
Full article ">Figure 3
<p>Hardware–software architecture schema.</p>
Full article ">Figure 4
<p>Simulation environment schema.</p>
Full article ">Figure 5
<p>Functional blocks partitioning schema.</p>
Full article ">Figure 6
<p>Detailed heterogeneous architecture pipeline.</p>
Full article ">Figure 7
<p>Data association pipeline.</p>
Full article ">Figure 8
<p>Particle Initialization—elapsed time.</p>
Full article ">Figure 9
<p>Particle Prediction—elapsed time.</p>
Full article ">Figure 10
<p>Mahalanobis Distance—elapsed time.</p>
Full article ">Figure 11
<p>Problem Preparation—elapsed time.</p>
Full article ">Figure 12
<p>Branch and Bound—elapsed time.</p>
Full article ">Figure 13
<p>Proposal Adjustment—elapsed time.</p>
Full article ">Figure 14
<p>Landmark Estimation—elapsed time.</p>
Full article ">Figure 15
<p>Resampling traditional methods—elapsed time.</p>
Full article ">Figure 16
<p>Resampling alternative methods—elapsed time.</p>
Full article ">
10 pages, 620 KiB  
Article
Serum Tau Species in Progressive Supranuclear Palsy: A Pilot Study
by Costanza Maria Cristiani, Luana Scaramuzzino, Elvira Immacolata Parrotta, Giovanni Cuda, Aldo Quattrone and Andrea Quattrone
Diagnostics 2024, 14(23), 2746; https://doi.org/10.3390/diagnostics14232746 - 5 Dec 2024
Viewed by 722
Abstract
Background/Objectives: Progressive Supranuclear Palsy (PSP) is a tauopathy showing a marked symptoms overlap with Parkinson’s Disease (PD). PSP pathology suggests that tau protein might represent a valuable biomarker to distinguish between the two diseases. Here, we investigated the presence and diagnostic value of [...] Read more.
Background/Objectives: Progressive Supranuclear Palsy (PSP) is a tauopathy showing a marked symptoms overlap with Parkinson’s Disease (PD). PSP pathology suggests that tau protein might represent a valuable biomarker to distinguish between the two diseases. Here, we investigated the presence and diagnostic value of six different tau species (total tau, 4R-tau isoform, tau aggregates, p-tau202, p-tau231 and p-tau396) in serum from 13 PSP and 13 PD patients and 12 healthy controls (HCs). Methods: ELISA commercial kits were employed to assess all the tau species except for t-tau, which was assessed by a single molecule array (SIMOA)-based commercial kit. Possible correlations between tau species and biological and clinical features of our cohorts were also evaluated. Results: Among the six tau species tested, only p-tau396 was detectable in serum. Concentration of p-tau396 was significantly higher in both PSP and PD groups compared to HC, but PSP and PD patients showed largely overlapping values. Moreover, serum concentration of p-tau396 strongly correlated with disease severity in PSP and not in PD. Conclusions: Overall, we identified serum p-tau396 as the most expressed phosphorylated tau species in serum and as a potential tool for assessing PSP clinical staging. Moreover, we demonstrated that other p-tau species may be present at too low concentrations in serum to be detected by ELISA, suggesting that future work should focus on other biological matrices. Full article
Show Figures

Figure 1

Figure 1
<p>Serum concentration of p-tau396 in PSP (<span class="html-italic">n</span> = 13), PD (<span class="html-italic">n</span> = 13) and HC (<span class="html-italic">n</span> = 12). Data are summarized as box plots. Ranges are depicted as vertical lines while median, 25th percentile and 75th percentile are depicted as middle, lower and upper lines, respectively. Data were analyzed by ANOVA followed by Turkey’s LSD post hoc test. PSP = progressive supranuclear palsy; PD = Parkinson’s disease; HC = healthy control.</p>
Full article ">Figure 2
<p>Correlation between serum p-tau396 levels and PSP Rating Scale in PSP patients. The analysis was performed by Spearman’s correlation test, and the obtained rho coefficient and <span class="html-italic">p</span>-value are reported in the plot.</p>
Full article ">
20 pages, 14037 KiB  
Article
Algorithmic Efficiency in Convex Hull Computation: Insights from 2D and 3D Implementations
by Hyun Kwon, Sehong Oh and Jang-Woon Baek
Symmetry 2024, 16(12), 1590; https://doi.org/10.3390/sym16121590 - 28 Nov 2024
Cited by 2 | Viewed by 1449
Abstract
This study examines various algorithms for computing the convex hull of a set of n points in a d-dimensional space. Convex hulls are fundamental in computational geometry and are applied in computer graphics, pattern recognition, and computational biology. Such convex hulls can also [...] Read more.
This study examines various algorithms for computing the convex hull of a set of n points in a d-dimensional space. Convex hulls are fundamental in computational geometry and are applied in computer graphics, pattern recognition, and computational biology. Such convex hulls can also be useful in symmetry problems. For instance, when points are arranged symmetrically, the convex hull is also likely to be symmetrically shaped, which can be useful for object recognition in computer vision or pattern recognition. The focus is primarily on two-dimensional algorithms, including well-known methods like Gift Wrapping, Graham Scan, Divide and Conquer, QuickHull, TORCH, Kirkpatrick–Sediel, and Chan’s algorithms. These algorithms vary in terms of time complexity and scalability to higher dimensions. This study is extended to three-dimensional convex hull algorithms, such as NAW, randomized insertion, and parallelized versions, such as CudaHull and CudaChain. This study aimed to elucidate the operational principles, step-by-step procedures, and comparative time complexities of each algorithm. The implementation in Python facilitates a detailed comparison of the algorithmic performance through stepwise analysis and graphical outputs. The ultimate goal is to provide insights into the strengths and weaknesses of each algorithm under various scenarios, thereby offering a comprehensive guide for practical implementation. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

Figure 1
<p>Implementation of the Jarvis’s March algorithm.</p>
Full article ">Figure 2
<p>Implementation of the Graham Scan algorithm.</p>
Full article ">Figure 3
<p>Implementation of the Divide and Conquer algorithm.</p>
Full article ">Figure 4
<p>Implementation of Chan’s algorithm.</p>
Full article ">Figure 5
<p>Implementation of the QuickHull algorithm.</p>
Full article ">Figure 6
<p>Implementation of the Kirkpatrick–Seidel algorithm.</p>
Full article ">Figure 7
<p>Implementation of the TORCH algorithm.</p>
Full article ">Figure 8
<p>Overview of the general case.</p>
Full article ">Figure 9
<p>Average time for several 2D algorithms for each point (point shapes: Disc, Square, and Though).</p>
Full article ">Figure 10
<p>Average time for several 2D algorithms for each point (point shapes: Circle and Star).</p>
Full article ">Figure 11
<p>Implementation of the Jarvis’s March (Gift Wrapping) algorithm.</p>
Full article ">
18 pages, 1757 KiB  
Article
End-to-End Deployment of Winograd-Based DNNs on Edge GPU
by Pierpaolo Mori, Mohammad Shanur Rahman, Lukas Frickenstein, Shambhavi Balamuthu Sampath, Moritz Thoma, Nael Fasfous, Manoj Rohit Vemparala, Alexander Frickenstein, Walter Stechele and Claudio Passerone
Electronics 2024, 13(22), 4538; https://doi.org/10.3390/electronics13224538 - 19 Nov 2024
Viewed by 940
Abstract
The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and [...] Read more.
The Winograd algorithm reduces the computational complexity of convolutional neural networks (CNNs) by minimizing the number of multiplications required for convolutions, making it particularly suitable for resource-constrained edge devices. Concurrently, most edge hardware accelerators utilize 8-bit integer arithmetic to enhance energy efficiency and reduce inference latency, requiring the quantization of CNNs before deployment. Combining Winograd-based convolution with quantization offers the potential for both performance acceleration and reduced energy consumption. However, prior research has identified significant challenges in this combination, particularly due to numerical instability and substantial accuracy degradation caused by the transformations required in the Winograd domain, making the two techniques incompatible on edge hardware. In this work, we describe our latest training scheme, which addresses these challenges, enabling the successful integration of Winograd-accelerated convolution with low-precision quantization while maintaining high task-related accuracy. Our approach mitigates the numerical instability typically introduced during the transformation, ensuring compatibility between the two techniques. Additionally, we extend our work by presenting a custom-optimized CUDA implementation of quantized Winograd convolution for NVIDIA edge GPUs. This implementation takes full advantage of the proposed training scheme, achieving both high computational efficiency and accuracy, making it a compelling solution for edge-based AI applications. Our training approach enables significant MAC reduction with minimal impact on prediction quality. Furthermore, our hardware results demonstrate up to a 3.4× latency reduction for specific layers, and a 1.44× overall reduction in latency for the entire DeepLabV3 model, compared to the standard implementation. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>The three steps of the <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> Winograd algorithm: (1) input and weight transformation, (2) element-wise matrix multiplication (EWMM) of the transformed matrices, and (3) inverse transformation to produce the spatial output feature maps. The numerical instability due to quantization is highlighted.</p>
Full article ">Figure 2
<p>Comparison of the (<b>a</b>) standard Winograd quantized transformation against (<b>b</b>) the Winograd quantized transformation that leverages trainable clipping factors to better exploit the quantized range.</p>
Full article ">Figure 3
<p>Overview of the proposed Winograd aware quantized training. Straight-through estimator (STE) is used to approximate the gradient of the quantization function. Trainable clipping factors <span class="html-italic">c</span>, <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>a</mi> </mrow> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>α</mi> <mrow> <mi>t</mi> <mi>w</mi> </mrow> </msub> </semantics></math> are highlighted in <span style="color: #FF0000">red</span>.</p>
Full article ">Figure 4
<p>Input transformation kernel overview. The input volume is divided in sub-volumes and each thread block is responsible for the transformation of a sub-volume.</p>
Full article ">Figure 5
<p>Element-wise matrix multiplication kernel overview. The computation is organized in <math display="inline"><semantics> <mrow> <mn>6</mn> <mo>×</mo> <mn>6</mn> </mrow> </semantics></math> GEMMs. Each one is responsible for the computation of <math display="inline"><semantics> <mrow> <msub> <mi>N</mi> <mrow> <mi>t</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mi>s</mi> </mrow> </msub> <mo>×</mo> <msub> <mi>C</mi> <mi>o</mi> </msub> </mrow> </semantics></math> output pixels in the Winograd domain.</p>
Full article ">Figure 6
<p>Inverse transformation kernel overview. The Winograd tiles produced by the EWMM kernel are transformed back to the spatial domain. Each thread block is responsible for the computation of a <math display="inline"><semantics> <mrow> <mn>4</mn> <mo>×</mo> <mn>4</mn> <mo>×</mo> <msub> <mi>P</mi> <mrow> <mi>o</mi> <mi>c</mi> </mrow> </msub> </mrow> </semantics></math> output pixel.</p>
Full article ">Figure 7
<p>Numerical distributions of example layers for transformed weights and activations of ResNet-20 on CIFAR-10. The values in the clipped range (green) sufficiently contain the information needed to maintain high-accuracy full 8-bit Winograd.</p>
Full article ">Figure 8
<p>Latency speedup brought by the custom Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> kernels compared to cuDNN convolution on Tensor Cores (<tt>int8x32</tt>).</p>
Full article ">Figure 9
<p>The latency contribution of each of the three steps in the Winograd <math display="inline"><semantics> <mrow> <mi>F</mi> <mo>(</mo> <mn>4</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> algorithm. In each sub-figure, the spatial dimensions are fixed, while the channel dimensions are varied.</p>
Full article ">
22 pages, 689 KiB  
Article
GPU Accelerating Algorithms for Three-Layered Heat Conduction Simulations
by Nicolás Murúa, Aníbal Coronel, Alex Tello, Stefan Berres and Fernando Huancas
Mathematics 2024, 12(22), 3503; https://doi.org/10.3390/math12223503 - 9 Nov 2024
Viewed by 726
Abstract
In this paper, we consider the finite difference approximation for a one-dimensional mathematical model of heat conduction in a three-layered solid with interfacial conditions for temperature and heat flux between the layers. The finite difference scheme is unconditionally stable, convergent, and equivalent to [...] Read more.
In this paper, we consider the finite difference approximation for a one-dimensional mathematical model of heat conduction in a three-layered solid with interfacial conditions for temperature and heat flux between the layers. The finite difference scheme is unconditionally stable, convergent, and equivalent to the solution of two linear algebraic systems. We evaluate various methods for solving the involved linear systems by analyzing direct and iterative solvers, including GPU-accelerated approaches using CuPy and PyCUDA. We evaluate performance and scalability and contribute to advancing computational techniques for modeling complex physical processes accurately and efficiently. Full article
(This article belongs to the Special Issue Advances in High-Performance Computing, Optimization and Simulation)
Show Figures

Figure 1

Figure 1
<p>Visual representation of the three-layered solid.</p>
Full article ">Figure 2
<p>General flowchart for the numerical computation of the solution of (<a href="#FD9-mathematics-12-03503" class="html-disp-formula">9</a>)–(<a href="#FD12-mathematics-12-03503" class="html-disp-formula">12</a>) by applying the finite difference scheme (<a href="#FD13-mathematics-12-03503" class="html-disp-formula">13</a>) and (14).</p>
Full article ">Figure 3
<p>Specific flowchart for the numerical computation of the solution of (<a href="#FD9-mathematics-12-03503" class="html-disp-formula">9</a>)–(<a href="#FD12-mathematics-12-03503" class="html-disp-formula">12</a>) by applying the finite difference scheme (<a href="#FD13-mathematics-12-03503" class="html-disp-formula">13</a>) and (14).</p>
Full article ">Figure 4
<p>Comparison between analytical solution and numerical solution using the Jacobi method: (<b>a</b>) Analytical solution, (<b>b</b>) Jacobi method with <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>64</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>=</mo> <mn>500</mn> </mrow> </semantics></math>. The numerical temperature profile is clearly different from the analytic temperature profile. The inconsistency originated in the incorrect solution of the linear system by the selected linear solver.</p>
Full article ">Figure 5
<p>Numerical temperature profiles obtained with the conjugate gradient method, with (<b>a</b>) <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>64</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>=</mo> <mn>1000</mn> </mrow> </semantics></math>, and (<b>b</b>) <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>=</mo> <mn>5000</mn> </mrow> </semantics></math>. The numerical temperature profiles are clearly different from the analytic temperature profiles. The figures show the inconsistency of the linear solver to approximate the linear system of the difference scheme.</p>
Full article ">Figure 6
<p>Temperature profiles for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math> and <span class="html-italic">N</span> = 10,000 obtained with (<b>a</b>) LU method in the case, (<b>b</b>) QR method. The figure show that the numerical temperature profiles converges the analytic temperature profile when we consider the LU linear solver to approximate the linear system of the difference scheme.</p>
Full article ">Figure 7
<p>Comparison of computational times as <span class="html-italic">N</span> increases with a fixed value of <math display="inline"><semantics> <msub> <mi>m</mi> <mi>i</mi> </msub> </semantics></math> for steps 3 and 4 on GPU and CPU in logarithmic scale: (<b>a</b>) Time in seconds for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>128</mn> </mrow> </semantics></math>, (<b>b</b>) Speed-up obtained with GPU for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>128</mn> </mrow> </semantics></math>, (<b>c</b>) Time in seconds for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math>, (<b>d</b>) Speed-up obtained with GPU for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 8
<p>Comparison of computational times as <span class="html-italic">N</span> increases with a fixed value of <math display="inline"><semantics> <msub> <mi>m</mi> <mi>i</mi> </msub> </semantics></math> for LU and QR solvers on GPU and CPU in logarithmic scale: (<b>a</b>) Time in seconds for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>128</mn> </mrow> </semantics></math>, (<b>b</b>) speed-up obtained with GPU for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>128</mn> </mrow> </semantics></math>, (<b>c</b>) Time in seconds for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math>, (<b>d</b>) Speed-Up obtained with GPU for <math display="inline"><semantics> <mrow> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>2048</mn> </mrow> </semantics></math>.</p>
Full article ">
17 pages, 3237 KiB  
Article
ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data
by Yuri Rossi Tonin, Alan Zanoni Peixinho, Mauro Luiz Brandao-Junior, Paola Ferraz and Eduardo Xavier Miqueles
J. Imaging 2024, 10(11), 286; https://doi.org/10.3390/jimaging10110286 - 7 Nov 2024
Viewed by 1356
Abstract
We introduce <tt>ssc-cdi</tt>, an open-source software package from the Sirius Scientific Computing family, designed for memory-efficient, single-node multi-GPU ptychography reconstruction. <tt>ssc-cdi</tt> offers a range of reconstruction engines in Python version 3.9.2 and C++/CUDA. It aims at developing local expertise and customized solutions to [...] Read more.
We introduce <tt>ssc-cdi</tt>, an open-source software package from the Sirius Scientific Computing family, designed for memory-efficient, single-node multi-GPU ptychography reconstruction. <tt>ssc-cdi</tt> offers a range of reconstruction engines in Python version 3.9.2 and C++/CUDA. It aims at developing local expertise and customized solutions to meet the specific needs of beamlines and user community of the Brazilian Synchrotron Light Laboratory (LNLS). We demonstrate ptychographic reconstruction of beamline data and present benchmarks for the package. Results show that <tt>ssc-cdi</tt> effectively handles extreme datasets typical of modern X-ray facilities without significantly compromising performance, offering a complementary approach to well-established packages of the community and serving as a robust tool for high-resolution imaging applications. Full article
(This article belongs to the Special Issue Recent Advances in X-ray Imaging)
Show Figures

Figure 1

Figure 1
<p>Diagram illustrating the batch distributions of measurements for three GPUs. Each colored block represents data inside of the respective GPU. A batch of <math display="inline"><semantics> <mrow> <mi>B</mi> <mo>≤</mo> <mi>N</mi> </mrow> </semantics></math> measurements is distributed to the GPU memory, so that the wavefronts are updated in parallel. Once a GPU finishes processing and is made available, a remaining batch of unprocessed data is loaded from RAM. After all batches have been loaded and all the wavefronts updated, the new object and probe matrices are calculated by GPU<sub>0</sub> and then broadcasted to the other GPUs, so that each of them has faster access to <span class="html-italic">O</span> and <span class="html-italic">P</span> in the subsequent iteration.</p>
Full article ">Figure 2
<p>Ptychography reconstruction of a Siemens Star measured at CARNAÚBA beamline. The finest features of the innermost circles are spaced <math display="inline"><semantics> <mrow> <mn>15</mn> <mspace width="0.166667em"/> <mi>nm</mi> </mrow> </semantics></math> from each other. The complex probe is shown in an hsv colormap, saturation encoding magnitude and hue encoding the phase.</p>
Full article ">Figure 3
<p>Comparison of the simulated sample against the reconstruction using the DM algorithm from different packages. The insets show a zoomed region from the red square in the object and phase reconstructions. The reconstruction for <tt>ssc-cdi</tt> used the RAAR algorithm with parameter <math display="inline"><semantics> <mrow> <mi>β</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>, such that the update function equals that of DM. For <tt>PyNX</tt> and <tt>PtyPy</tt>; we used the DM engine directly. In all cases, the same initial guesses were used: random magnitude and constant phase for the object array, and an inverse Fourier transform of the averaged measurements for the probe.</p>
Full article ">Figure 4
<p>Single GPU performance of DM and PIE algorithms across different packages. The inset shows the same data without log scale on the vertical axis. Missing points on some curves indicate dimensions that were not supported by a specific engine. Note that <tt>PyNX</tt> does not provide an engine for an algorithm of the PIE family for comparison.</p>
Full article ">Figure 5
<p>Multi-GPU performance of DM algorithm for <tt>ssc-cdi</tt> and <tt>PtyPy</tt> using batch sizes of (<b>a</b>) 128 and (<b>b</b>) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.</p>
Full article ">Figure 5 Cont.
<p>Multi-GPU performance of DM algorithm for <tt>ssc-cdi</tt> and <tt>PtyPy</tt> using batch sizes of (<b>a</b>) 128 and (<b>b</b>) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.</p>
Full article ">Figure 6
<p>Single GPU performance of <tt>ssc-cdi</tt> for DM and PIE engines at a conventional machine. RAAR was run with batch size <math display="inline"><semantics> <mrow> <mi>B</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> and managed to run up to a data size of <math display="inline"><semantics> <msup> <mn>2048</mn> <mn>2</mn> </msup> </semantics></math>.</p>
Full article ">
17 pages, 1369 KiB  
Article
Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures
by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Russell Marki, Robert Robey and Marko Knezevic
Information 2024, 15(11), 716; https://doi.org/10.3390/info15110716 - 7 Nov 2024
Viewed by 949
Abstract
Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel [...] Read more.
Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel simulations of solid mechanics on multi-core CPUs and GPUs using a single-code implementation. This portability is made possible by the C++ matrix and array (MATAR) library, which interfaces with the C++ Kokkos library, enabling the selection of fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. MATAR simplifies the transition from Fortran to C++ and Kokkos, making it easier to modernize legacy solid mechanics codes. We applied this approach to modernize a suite of constitutive models and to demonstrate substantial performance improvements across different computer architectures. This paper includes comparative performance studies using multi-core CPUs along with AMD and NVIDIA GPUs. Results are presented using a hypoelastic–plastic model, a crystal plasticity model, and the viscoplastic self-consistent generalized material model (VPSC-GMM). The results underscore the potential of using the MATAR library and modern computer architectures to accelerate solid mechanics simulations. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
Show Figures

Figure 1

Figure 1
<p>In this work, the MATAR library is used to modernize multiple Fortran material model implementations that are then coupled to the C++ Fierro mechanics code, which is also based on the MATAR library.</p>
Full article ">Figure 2
<p>The runtime scaling results are presented for the 2D axisymmetric, metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as wall clock time in seconds against increasing mesh resolution. Even on 2D meshes, significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.</p>
Full article ">Figure 3
<p>The runtime scaling results are presented for the 3D metal rod impact test conducted on both multi-core Haswell CPUs and GPU architectures. The data are displayed as the wall clock time in seconds against increasing the mesh resolution in 3D. The mesh resolution is the number of elements in the cross section of the rod by the number of elements in the vertical direction. Significant accelerations of the runtime, relative to the serial, are possible on GPUs for larger mesh sizes.</p>
Full article ">Figure 4
<p>Speedup comparisons for the 2D axisymmetric, metal rod impact test on a 40 × 416 2D cylindrical coordinate mesh using an equation of state with an isotropic hypoelastic–plastic model. Plot (<b>a</b>) presents the speedup compared to a serial run, and Plot (<b>b</b>) presents the speedup compared to a parallel 20 core run on the Haswell CPU. On a 2D mesh, GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.</p>
Full article ">Figure 5
<p>Speedup comparisons for the 3D metal rod impact test on a mesh with 40 elements in the cross-section by 416 elements in the vertical direction using an equation of state with an isotropic hypoelastic–plastic model. Plot (<b>a</b>) presents the speedup compared to a serial run, and Plot (<b>b</b>) presents the speedup compared to a parallel 20 core run on the Haswell CPU. GPUs give a significant boost to runtime performance over a serial and a multi-core CPU.</p>
Full article ">Figure 6
<p>Von Mises-equivalent stress results in each element of the mesh for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model.</p>
Full article ">Figure 7
<p>Speedup comparisons to a serial run for the 3D metal rod impact test using an elasto-viscoplastic single-crystal plasticity model. The Power9 CPU was used for the serial, 8-core, and 16-core calculations.</p>
Full article ">Figure 8
<p>The runtimes on a V100 GPU are shorter than 20 cores on a Power9 CPU only when there are many instances of the VPSC model.</p>
Full article ">Figure 9
<p>The scale-bridging VPSC-GMM model was made performant and portable across CPU and GPU architectures using the MATAR library. The speedup results are for 30, 100, 200, and 500 grains on five different computer architectures.</p>
Full article ">Figure 10
<p>A Taylor anvil impact test with a polycrystalline tantalum was simulated using the Fierro mechanics code with the VPSC-GMM. The rod deformation is a function of the texture of the material. The rod is colored by the von Mises stress [MPa].</p>
Full article ">Figure 11
<p>Speedup comparisons are shown for the VPSC-GMM coupled to the Fierro mechanics code. (<b>a</b>) Using the linear extrapolation scheme in combination with the VPSC model generates thread divergence can greatly hinder fine-grain parallelism. (<b>b</b>) Not using the linear extrapolation scheme in VPSC-GMM yields a favorable speedup on GPUs because it eliminates thread divergence.</p>
Full article ">
10 pages, 860 KiB  
Article
Erythrocytic α-Synuclein in Parkinson’s Disease and Progressive Supranuclear Palsy—A Pilot Study
by Costanza Maria Cristiani, Luana Scaramuzzino, Elvira Immacolata Parrotta, Giovanni Cuda, Aldo Quattrone and Andrea Quattrone
Biomedicines 2024, 12(11), 2510; https://doi.org/10.3390/biomedicines12112510 - 2 Nov 2024
Viewed by 884
Abstract
Background/Objectives: The current research examines the accuracy of α-synuclein in RBCs as a diagnostic biomarker for PD and PSP, despite their distinct molecular etiologies. Methods: We used ELISA to measure total, oligomeric, and p129-α-synuclein levels in erythrocytes from 8 PSP patients, 19 PD [...] Read more.
Background/Objectives: The current research examines the accuracy of α-synuclein in RBCs as a diagnostic biomarker for PD and PSP, despite their distinct molecular etiologies. Methods: We used ELISA to measure total, oligomeric, and p129-α-synuclein levels in erythrocytes from 8 PSP patients, 19 PD patients, and 18 healthy controls (HCs). The classification performances of RBC α-synuclein levels were investigated by receiver operator characteristic (ROC) curve. We also evaluated a possible correlation between RBC α-synuclein level and the biological and clinical features of our cohorts. Results: RBC total α-synuclein was higher in PSP patients compared to both PD patients and HCs, achieving good classification performance (AUC: 0.853) in distinguishing PSP patients from PD patients, with a sensitivity of 100% and a specificity of 70.6%; moreover, the levels of this biomarker positively correlated with disease severity in PSP group. Regarding oligomeric α-synuclein and p129-α-synuclein, the latter was slightly increased in RBCs from PSP patients compared to HCs, but no correlations were detected. Conclusions: Although these findings need to be confirmed in larger studies, our pilot work suggests that RBC total α-synuclein may represent a potential molecular biomarker for the differential diagnosis and clinical staging of PSP. Full article
Show Figures

Figure 1

Figure 1
<p>Erythrocytic concentration, corrected for Hb content, of total (<b>A</b>) oligomeric (<b>B</b>) and p129- α-synuclein (<b>C</b>) in PSP patients (<span class="html-italic">n</span> = 8), PD patients (<span class="html-italic">n</span> = 19), and HC (<span class="html-italic">n</span> = 18). Data are summarized as box plots, in which the lower, upper, and middle lines of boxes represent the 25th percentile, 75th percentile, and median, respectively, while limits of vertical lines indicate ranges. Shown <span class="html-italic">p</span>-values were obtained by ANCOVA with age and sex as covariates followed by Turkey’s LSD post hoc test. o-α-synuclein = oligomeric α-synuclein; PSP = progressive supranuclear palsy; PD = Parkinson’s disease; HC = healthy control.</p>
Full article ">Figure 2
<p>Performance of erythrocytic total α-synuclein/Hb in differentiating PSP from PD patients.</p>
Full article ">
65 pages, 2635 KiB  
Tutorial
Understanding the Flows of Signals and Gradients: A Tutorial on Algorithms Needed to Implement a Deep Neural Network from Scratch
by Przemysław Klęsk
Appl. Sci. 2024, 14(21), 9972; https://doi.org/10.3390/app14219972 - 31 Oct 2024
Viewed by 812
Abstract
Theano, TensorFlow, Keras, Torch, PyTorch, and other software frameworks have remarkably stimulated the popularity of deep learning (DL). Apart from all the good they achieve, the danger of such frameworks is that they unintentionally spur a black-box attitude. Some practitioners play around with [...] Read more.
Theano, TensorFlow, Keras, Torch, PyTorch, and other software frameworks have remarkably stimulated the popularity of deep learning (DL). Apart from all the good they achieve, the danger of such frameworks is that they unintentionally spur a black-box attitude. Some practitioners play around with building blocks offered by frameworks and rely on them, having a superficial understanding of the internal mechanics. This paper constitutes a concise tutorial that elucidates the flows of signals and gradients in deep neural networks, enabling readers to successfully implement a deep network from scratch. By “from scratch”, we mean with access to a programming language and numerical libraries but without any components that hide DL computations underneath. To achieve this goal, the following five topics need to be well understood: (1) automatic differentiation, (2) the initialization of weights, (3) learning algorithms, (4) regularization, and (5) the organization of computations. We cover all of these topics in the paper. From a tutorial perspective, the key contributions include the following: (a) proposition of R and S operators for tensors—rashape and stack, respectively—that facilitate algebraic notation of computations involved in convolutional, pooling, and flattening layers; (b) a Python project named hmdl (“home-made deep learning”); and (c) consistent notation across all mathematical contexts involved. The hmdl project serves as a practical example of implementation and a reference. It was built using NumPy and Numba modules with JIT and CUDA amenities applied. In the experimental section, we compare hmdl implementation to Keras (backed with TensorFlow). Finally, we point out the consistency of the two in terms of convergence and accuracy, and we observe the superiority of the latter in terms of efficiency. Full article
(This article belongs to the Special Issue Advanced Digital Signal Processing and Its Applications)
Show Figures

Figure 1

Figure 1
<p>Milestones in the history of neural networks and deep learning [<a href="#B3-applsci-14-09972" class="html-bibr">3</a>,<a href="#B4-applsci-14-09972" class="html-bibr">4</a>,<a href="#B5-applsci-14-09972" class="html-bibr">5</a>,<a href="#B6-applsci-14-09972" class="html-bibr">6</a>,<a href="#B7-applsci-14-09972" class="html-bibr">7</a>,<a href="#B8-applsci-14-09972" class="html-bibr">8</a>,<a href="#B9-applsci-14-09972" class="html-bibr">9</a>,<a href="#B10-applsci-14-09972" class="html-bibr">10</a>,<a href="#B11-applsci-14-09972" class="html-bibr">11</a>,<a href="#B12-applsci-14-09972" class="html-bibr">12</a>,<a href="#B13-applsci-14-09972" class="html-bibr">13</a>,<a href="#B14-applsci-14-09972" class="html-bibr">14</a>,<a href="#B15-applsci-14-09972" class="html-bibr">15</a>,<a href="#B16-applsci-14-09972" class="html-bibr">16</a>,<a href="#B17-applsci-14-09972" class="html-bibr">17</a>,<a href="#B18-applsci-14-09972" class="html-bibr">18</a>,<a href="#B19-applsci-14-09972" class="html-bibr">19</a>,<a href="#B20-applsci-14-09972" class="html-bibr">20</a>,<a href="#B21-applsci-14-09972" class="html-bibr">21</a>,<a href="#B22-applsci-14-09972" class="html-bibr">22</a>,<a href="#B23-applsci-14-09972" class="html-bibr">23</a>,<a href="#B24-applsci-14-09972" class="html-bibr">24</a>,<a href="#B25-applsci-14-09972" class="html-bibr">25</a>,<a href="#B26-applsci-14-09972" class="html-bibr">26</a>,<a href="#B27-applsci-14-09972" class="html-bibr">27</a>,<a href="#B28-applsci-14-09972" class="html-bibr">28</a>,<a href="#B29-applsci-14-09972" class="html-bibr">29</a>,<a href="#B30-applsci-14-09972" class="html-bibr">30</a>,<a href="#B31-applsci-14-09972" class="html-bibr">31</a>,<a href="#B32-applsci-14-09972" class="html-bibr">32</a>,<a href="#B33-applsci-14-09972" class="html-bibr">33</a>,<a href="#B34-applsci-14-09972" class="html-bibr">34</a>,<a href="#B35-applsci-14-09972" class="html-bibr">35</a>].</p>
Full article ">Figure 2
<p>Example medium-sized deep network built with popular layer types: convolutional, max pooling, dropout, flattening, and dense (prepared for the CIFAR-10 data set; see experiment with ID 3392187021 in <a href="#sec8-applsci-14-09972" class="html-sec">Section 8</a>).</p>
Full article ">Figure 3
<p>Example of a simple neural network for 10-class classification.</p>
Full article ">Figure 4
<p>Illustration of forward and backward computations at the junction of two convolutional layers <math display="inline"><semantics> <mi>α</mi> </semantics></math> and <math display="inline"><semantics> <mi>β</mi> </semantics></math> for the network structure from <a href="#applsci-14-09972-f003" class="html-fig">Figure 3</a>.</p>
Full article ">Figure 5
<p>Forward computations of max and average pooling for <math display="inline"><semantics> <mrow> <mi>S</mi> <mo>=</mo> <mn>4</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 6
<p>Illustration of backward computations for flattening and dropout layers. High dropout rate <math display="inline"><semantics> <mrow> <msup> <mi>r</mi> <mo>*</mo> </msup> <mo>=</mo> <mn>0.875</mn> </mrow> </semantics></math> chosen for readability of surviving connections.</p>
Full article ">Figure 7
<p>Illustration of two stages of backward computations for a dense layer using a softmax activation function.</p>
Full article ">Figure 8
<p>Forward pass leading to exploding signals: a fake network consisting of 100 dense layers with 512 neurons each (no activations) and intial weights drawn from a standard normal distribution.</p>
Full article ">Figure 9
<p>Fake forward pass leading to vanishing signals due to intial weights drawn from a normal distribution with standard deviation scaled by <math display="inline"><semantics> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>2</mn> </mrow> </msup> </semantics></math>, regardless of <span class="html-italic">N</span>.</p>
Full article ">Figure 10
<p>Fake forward pass leading to a stable numerical behavior due to initial weights drawn from a properly scaled normal distribution with standard deviation <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>/</mo> <msqrt> <mi>N</mi> </msqrt> </mrow> </semantics></math>.</p>
Full article ">Figure 11
<p>Training costs (losses) of Adam and other SGD algorithms on MNIST (<b>a</b>) and CIFAR-10 (<b>b</b>) data sets. Source: (Kingma and Ba, 2014) [<a href="#B25-applsci-14-09972" class="html-bibr">25</a>].</p>
Full article ">Figure 12
<p>UML class diagram of <span class="html-italic">hdml</span> project (<a href="https://github.com/pklesk/hmdl/blob/main/uml/classes.pdf" target="_blank">https://github.com/pklesk/hmdl/blob/main/uml/classes.pdf</a>) (accessed on 17 October 2024).</p>
Full article ">Figure 13
<p>Functions executing forward and backward computations from the abstract class <tt>Layer</tt>.</p>
Full article ">Figure 14
<p>The core of the <tt>fit</tt> function of <tt>SequentialClassifier</tt>. The main training loop (over epochs and batches) executes forward and backward computations through layers for each batch.</p>
Full article ">Figure 15
<p>Forward and backward passes for <tt>SequentialClassifier</tt>.</p>
Full article ">Figure 16
<p>Automatic differentiation for class <tt>Flatten</tt>.</p>
Full article ">Figure 17
<p>Automatic differentiation for class <tt>Dropout</tt>.</p>
Full article ">Figure 18
<p>Automatic differentiation for class <tt>Dense</tt>.</p>
Full article ">Figure 19
<p>Forward computations for class <tt>MaxPool2D</tt>: the simplest <tt>numpy</tt>-based variant.</p>
Full article ">Figure 20
<p>Backward computations for class <tt>MaxPool2D</tt>: the simplest <tt>numpy</tt>-based variant.</p>
Full article ">Figure 21
<p>Backward computations of gradient for class <tt>Conv2D</tt>: the simplest <tt>numpy</tt>-based variant (GEMM).</p>
Full article ">Figure 22
<p>Backward computations of error propagation for class <tt>Conv2D</tt>: the simplest <tt>numpy</tt>-based variant (GEMM).</p>
Full article ">Figure 23
<p>Execution times of convolutional layers (64 input and 64 output channels; batch size: 32) for different implementation variants, filter sizes, and image sizes. For details of the hardware and software environment see page 42.</p>
Full article ">Figure 24
<p>Functions for Glorot initalization of weights in the hmdl project.</p>
Full article ">Figure 25
<p>Functions for He initalization of weights in the hmdl project.</p>
Full article ">Figure 26
<p>Implementation of Adam (coupled with regularization) in the hmdl project.</p>
Full article ">Figure 27
<p>Sample images from data sets applied in experiments.</p>
Full article ">Figure 28
<p>Main settings for an experiment in script <tt>experimenter.py</tt>.</p>
Full article ">Figure 29
<p>Description of a network structure in the hmdl project, declared in <tt>experimenter.py</tt>.</p>
Full article ">Figure 30
<p>Choice of data set, randomization seed, and other settings in the script <tt>experimenter.py</tt>.</p>
Full article ">Figure A1
<p>Forward computations for class <tt>MaxPool2D</tt>: variant based on <tt>numba</tt>’s just-in-time compilation.</p>
Full article ">Figure A2
<p>Backward computations for class <tt>MaxPool2D</tt>: variant based on <tt>numba</tt>’s just-in-time compilation.</p>
Full article ">Figure A3
<p>Forward computations for class <tt>MaxPool2D</tt>: variant implemented for GPU computations using <tt>numba.cuda</tt> module.</p>
Full article ">Figure A4
<p>CUDA kernel function <tt>do_forward_numba_cuda_direct_job</tt> for forward max pooling computations invoked via function <tt>do_forward_numba_cuda_direct</tt>.</p>
Full article ">Figure A5
<p>Backward computations for class <tt>MaxPool2D</tt>: variant implemented for GPU computations using <tt>numba.cuda</tt> module.</p>
Full article ">Figure A6
<p>CUDA kernel function <tt>do_backward_numba_cuda_direct_job</tt> for backward max pooling computations invoked via function <tt>do_backward_numba_cuda_direct</tt>.</p>
Full article ">Figure A7
<p>Backward computations of gradient for class <tt>Conv2D</tt>: variant based on <tt>numba</tt>’s just-in-time compilation.</p>
Full article ">Figure A8
<p>Backward computations of gradient for class <tt>Conv2D</tt>: variant based directly on definition, using GPU computations and <tt>numba.cuda</tt>.</p>
Full article ">Figure A9
<p>CUDA kernel function <tt>do_backward_numba_cuda_direct_job</tt> for backward convolutional computations invoked via function <tt>do_backward_numba_cuda_direct</tt>.</p>
Full article ">Figure A10
<p>Backward computations of gradient for class <tt>Conv2D</tt>: variant based on tiles, using GPU and <tt>numba.cuda</tt>.</p>
Full article ">Figure A11
<p>CUDA kernel function <tt>do_backward_numba_cuda_tiles_job</tt> for backward convolutional computations invoked via function <tt>do_backward_numba_cuda_tiles</tt>.</p>
Full article ">
13 pages, 735 KiB  
Article
Proximity Elongation Assay and ELISA for the Identification of Serum Diagnostic Biomarkers in Parkinson’s Disease and Progressive Supranuclear Palsy
by Costanza Maria Cristiani, Camilla Calomino, Luana Scaramuzzino, Maria Stella Murfuni, Elvira Immacolata Parrotta, Maria Giovanna Bianco, Giovanni Cuda, Aldo Quattrone and Andrea Quattrone
Int. J. Mol. Sci. 2024, 25(21), 11663; https://doi.org/10.3390/ijms252111663 - 30 Oct 2024
Viewed by 1021
Abstract
Clinical differentiation of progressive supranuclear palsy (PSP) from Parkinson’s disease (PD) is challenging due to overlapping phenotypes and late onset of PSP specific symptoms, highlighting the need for easily assessable biomarkers. We used proximity elongation assay (PEA) to analyze 460 proteins in serum [...] Read more.
Clinical differentiation of progressive supranuclear palsy (PSP) from Parkinson’s disease (PD) is challenging due to overlapping phenotypes and late onset of PSP specific symptoms, highlighting the need for easily assessable biomarkers. We used proximity elongation assay (PEA) to analyze 460 proteins in serum samples from 46 PD, 30 PSP patients, and 24 healthy controls. ANCOVA was used to identify the most promising proteins and machine learning (ML) XGBoost and random forest algorithms to assess their classification performance. Promising proteins were also quantified by ELISA. Moreover, correlations between serum biomarkers and biological and clinical features were investigated. We identified five proteins (TFF3, CPB1, OPG, CNTN1, TIMP4) showing different levels between PSP and PD, which achieved good performance (AUC: 0.892) when combined by ML. On the other hand, when the three most significant biomarkers (TFF3, CPB1 and OPG) were analyzed by ELISA, there was no difference between groups. Serum levels of TFF3 positively correlated with age in all subjects’ groups, while for OPG and CPB1 such a correlation occurred in PSP patients only. Moreover, CPB1 positively correlated with disease severity in PD, while no correlations were observed in the PSP group. Overall, we identified CPB1 correlating with PD severity, which may support clinical staging of PD. In addition, our results showing discrepancy between PEA and ELISA technology suggest that caution should be used when translating proteomic findings into clinical practice. Full article
Show Figures

Figure 1

Figure 1
<p>In Panel (<b>A</b>), proteins on the right of the <span class="html-italic">x</span>-axis have a higher concentration in PSP while proteins on the left have a higher concentration in PD. In Panel (<b>B</b>), proteins on the right of the <span class="html-italic">x</span>-axis have a higher concentration in PD while proteins on the left have a higher concentration in HC. In Panel (<b>C</b>), proteins on the right of the <span class="html-italic">x</span>-axis have a higher concentration in HC while proteins on the left have a higher concentration in PSP. Bonferroni’s correction method was employed to define the threshold for <span class="html-italic">p</span>-value significance. PD = Parkinson’s disease; PSP = progressive supranuclear palsy; HC = healthy control.</p>
Full article ">Figure 2
<p>Serum concentration of TFF3 (<b>A</b>), CPB1 (<b>B</b>), and OPG (<b>C</b>) in PD (n = 46), PSP (n = 30) and HC (n = 24), as measured by ELISA. In each box plot, the 25th percentile, 75th percentile, and median of data are depicted as lower, upper, and middle lines, respectively, while bars on vertical lines indicate ranges. The Kruskal–Wallis test was used to calculate shown p-values. TFF3 = trefoil factor 3; CPB1 = carboxypeptidase B1; OPG = osteoprotegerin; PD = Parkinson’s disease; PSP = progressive supranuclear palsy; HC = healthy controls.</p>
Full article ">
24 pages, 830 KiB  
Article
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Calvin Roth, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Marko Knezevic, Gavin Whetstone, Zachary Baker and Robert Robey
Information 2024, 15(11), 673; https://doi.org/10.3390/info15110673 - 28 Oct 2024
Cited by 1 | Viewed by 2353
Abstract
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ mat [...] Read more.
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for programming productivity, (2) usable data structures for data-oriented programming (DOP) for performance, and (3) a simple interface to the open-source C++ Kokkos library for portability and memory management across CPUs and GPUs. The portability across architectures with a single code implementation is achieved by automatically switching between diverse fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. The MATAR library solves many longstanding challenges associated with easily writing software that can run in parallel on any computer architecture. This work benefits projects seeking to write new C++ codes while also addressing the challenges of quickly making existing Fortran codes performant and portable over modern computer architectures with minimal syntactical changes from Fortran to C++. We demonstrate the feasibility of readily writing new C++ codes and modernizing existing codes with MATAR to be performant, parallel, and portable across diverse computer architectures. Full article
(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)
Show Figures

Figure 1

Figure 1
<p>The C++ MATAR library builds on the Kokkos library to enable diverse software to run on multi-core CPUs and GPUs with a single implementation and enables a common approach to modernize existing Fortran software and to write new C++ software.</p>
Full article ">Figure 2
<p>A range of dense (<b>left</b> chart) and sparse (<b>right</b> chart) data types in MATAR are shown above that are designed for performance portability across CPUs and GPUs. Additional data types are provided in MATAR than shown above here; for instance, there are data types that solely run on CPUs and dynamically resizable 1D array and 1D matrix types.</p>
Full article ">Figure 3
<p>On the left, a ring lattice with seven nodes is shown where each node is connected to its nearest one neighbor. On the right we replace three edges with random edges. In particular, we replaced (1,2) with (1,5), (2,3) with (2,6), and (7,1) with (7,5). This is a simple demonstration of the Watts–Strogatz random graph process.</p>
Full article ">Figure 4
<p>Comparison of runtimes, as a function of the nodes in the WS graph, on diverse GPUs and on an AMD EPYC 7502 multi-core CPU. A serial calculation (blue) is presented for comparison purposes. All GPUs deliver favorable acceleration of this test case over using 16 cores on the CPU.</p>
Full article ">Figure 5
<p>The strong scaling on the WS graph with 4000 nodes is shown. This scaling test was run at powers of 2<span class="html-italic"><sup>n</sup></span> for the number of cores from 1 to 32. Displayed is the log–log plot of number of cores versus runtime. Near perfect scaling is observed.</p>
Full article ">Figure 6
<p>A comparison between the Python networkx package and the C++ implementation with MATAR is shown for the average distance between nodes as a function of the rewire probability. The same trend is observed between the Python and C++ implementations, which is a desired outcome. Both codes yielded the expected, correct behavior for this network test case. An exact match between the two implementations is not expected since the results are probabilistic.</p>
Full article ">Figure 7
<p>Compared to the serial Python code, the MATAR implementation is serially 60× faster, is about 700× faster using 16 threads on an AMD EPYC 7502 multi-core CPU, is about 2050× faster on a Titan GPU, and is 2460× faster on a V100 GPU. The WS graph speed-up results reported here correspond to a simulation using 4000 nodes.</p>
Full article ">Figure 8
<p>Speed-up compared to serial of the forward propagation through an artificial neural network using 8 cores and 16 cores on a Haswell CPU with openMP, a Nvidia Tesla V100 GPU, a Nvidia A100 GPU, and an AMD MI50 GPU. The GPU architectures deliver favorable accelerations with widely varying sizes of the vector-array multiplications.</p>
Full article ">Figure 9
<p>Diagram of the mid-ocean ridge showing oceanic lithosphere increasing in thickness as it increases in distance from the ridge. Gray is the oceanic lithosphere, and red is the asthenosphere (mantle).</p>
Full article ">Figure 10
<p>Scaling plot for the half-space cooling test problem showing walltime in seconds for serial, 8 cores, and 16 cores on a Haswell CPU; and for Nvidia Tesla V100 GPU, Nvidia A100 GPU, Quadro RTX GPU, and AMD MI50 GPU architectures with increasing problem size (e.g., billions of years). GPU architectures significantly accelerate the calculation compared to the Haswell CPU.</p>
Full article ">Figure 11
<p>Speed-up with mesh refinement provided by MATAR with CUDA over the serial Python code using an V100 GPU is shown. 253X, 69X, and 1756X accelerations are observed.</p>
Full article ">
23 pages, 16714 KiB  
Article
A Geographically Weighted Regression–Compute Unified Device Architecture Approach to Explore the Spatial Agglomeration and Heterogeneity in Arable Land Consumption in Southwest China
by Chang Liu, Tingting Xu, Letao Han, Sapu Du and Aohua Tian
Agriculture 2024, 14(10), 1675; https://doi.org/10.3390/agriculture14101675 - 25 Sep 2024
Viewed by 884
Abstract
Arable land loss has become a critical issue in China because of rapid urbanization, industrial expansion, and unsustainable agricultural practices. While previous studies have explored the factors contributing to this loss, they often fall short in addressing the challenges of spatial heterogeneity and [...] Read more.
Arable land loss has become a critical issue in China because of rapid urbanization, industrial expansion, and unsustainable agricultural practices. While previous studies have explored the factors contributing to this loss, they often fall short in addressing the challenges of spatial heterogeneity and large-scale dataset analysis. This research introduces an innovative approach to geographically weighted regression (GWR) for assessing arable land loss in China, effectively addressing these challenges. Focusing on Chongqing, Guizhou, and Yunnan Provinces over the past two decades, it examines spatial autocorrelation with R-squared values exceeding 0.6 and residuals. Eight factors, including environmental elements (rain, evaporation, slope, digital elevation model) and human activities (distance to city, distance to roads, population, GDP), were analyzed. By visualizing and analyzing R² spatial patterns, the results reveal a clear spatial agglomeration distribution, primarily in urban areas with industries, highly urbanized cities, and flat terrains near rivers, influenced by GDP, population, rain, and slope. The novelty of this study is that it significantly enhances GWR computational capabilities for handling extensive datasets by utilizing Compute Unified Device Architecture (CUDA) on a high-performance GPU cloud server. Simultaneously, it conducts comprehensive analyses of the GWR model’s local results through visualization and spatial autocorrelation tools, enhancing the interpretability of the GWR model. Through spatial clustering analysis of local results, this study enables targeted exploration of factors influencing arable land changes in various temporal and spatial dimensions while also evaluating the reliability of the model results. Full article
(This article belongs to the Section Agricultural Economics, Policies and Rural Management)
Show Figures

Figure 1

Figure 1
<p>Location of the study area—Chongqing, Guizhou, and Yunnan in China—and its elevation range.</p>
Full article ">Figure 2
<p>The steps to implement GWR analysis with the grid method. (<b>a</b>) Create the fishnet, and (<b>b</b>) use the fishnet to demonstrate the arable land loss rate. The color of different degree represents the loss rate of arable land. The darker the color, the higher the rate of arable land loss.</p>
Full article ">Figure 3
<p>Steps in the analysis of arable land loss rates using a fishnet.</p>
Full article ">Figure 4
<p>Arable land loss during 2000–2010 (<b>a</b>) and 2010–2020 (<b>b</b>) in the three provinces considered in this study.</p>
Full article ">Figure 5
<p>Fishnet grid with arable land loss rates during 2000–2010 and 2010–2020.</p>
Full article ">Figure 6
<p>Spatial distribution of R<sup>2</sup> during 2000–2010 and 2010–2020 in the three provinces considered in this study. The color gradient represents the distribution of adjusted R<sup>2</sup> values across intervals. Deeper colors signify higher R<sup>2</sup> values, indicating stronger explanatory power of the variables for farmland degradation, while lighter colors indicate lower R<sup>2</sup> values, reflecting weaker explanatory power. The subgraph represents the highly adjusted R<sup>2</sup> cluster area.</p>
Full article ">Figure 7
<p>Spatial distribution of residual during 2000–2010 and 2010–2020 in the three provinces considered in this study. The red boxes represent areas with high residuals across three regions over different time periods.</p>
Full article ">
Back to TopTop