Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (594)

Search Parameters:
Keywords = graphical processing units

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
32 pages, 2960 KiB  
Article
Comparing Application-Level Hardening Techniques for Neural Networks on GPUs
by Giuseppe Esposito, Juan-David Guerrero-Balaguera, Josie E. Rodriguez Condia and Matteo Sonza Reorda
Electronics 2025, 14(5), 1042; https://doi.org/10.3390/electronics14051042 - 6 Mar 2025
Viewed by 79
Abstract
Neural networks (NNs) are essential in advancing modern safety-critical systems. Lightweight NN architectures are deployed on resource-constrained devices using hardware accelerators like Graphics Processing Units (GPUs) for fast responses. However, the latest semiconductor technologies may be affected by physical faults that can jeopardize [...] Read more.
Neural networks (NNs) are essential in advancing modern safety-critical systems. Lightweight NN architectures are deployed on resource-constrained devices using hardware accelerators like Graphics Processing Units (GPUs) for fast responses. However, the latest semiconductor technologies may be affected by physical faults that can jeopardize the NN computations, making fault mitigation crucial for safety-critical domains. The recent studies propose software-based Hardening Techniques (HTs) to address these faults. However, the proposed fault countermeasures are evaluated through different hardware-agnostic error models neglecting the effort required for their implementation and different test benches. Comparing application-level HTs across different studies is challenging, leaving it unclear (i) their effectiveness against hardware-aware error models on any NN and (ii) which HTs provide the best trade-off between reliability enhancement and implementation cost. In this study, application-level HTs are evaluated homogeneously and independently by performing a study on the feasibility of implementation and a reliability assessment under two hardware-aware error models: (i) weight single bit-flips and (ii) neuron bit error rate. Our results indicate that not all HTs suit every NN architecture, and their effectiveness varies depending on the evaluated error model. Techniques based on the range restriction of activation function consistently outperform others, achieving up to 58.23% greater mitigation effectiveness while keeping the introduced overhead at inference time low while requiring a contained effort in their implementation. Full article
Show Figures

Figure 1

Figure 1
<p>Basic convolutional block.</p>
Full article ">Figure 2
<p>Hierarchical GPU organization [<a href="#B46-electronics-14-01042" class="html-bibr">46</a>].</p>
Full article ">Figure 3
<p>Effect of the fault propagation up to the application level.</p>
Full article ">Figure 4
<p>Hardening technique evaluation procedure.</p>
Full article ">Figure 5
<p>Fault class distribution. In this figure, the HT names are abbreviated as follows: Baseline (BL), Adaptive Clipper (AC), Swap ReLU6 (SR) and median filter (MF).</p>
Full article ">Figure 6
<p>Accuracy degradation for HTs in front of errors injected in weights per bit location.</p>
Full article ">Figure 7
<p>Accuracy degradation for HTs when injecting errors in multiple FM bit locations.</p>
Full article ">
30 pages, 1684 KiB  
Article
Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations
by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi
Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025
Viewed by 206
Abstract
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

Figure 1
<p>Example of the configuration of two basis functions <math display="inline"><semantics> <msub> <mi>χ</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>χ</mi> <mn>2</mn> </msub> </semantics></math> (<math display="inline"><semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>). (<b>a</b>) The quartet combinations, and (<b>b</b>) the symmetry-based combinations for Basis-ERIs when <math display="inline"><semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>. By considering symmetrical relations, it becomes possible to reduce the number of Basis-ERIs, as shown by the upper triangular matrix in (<b>b</b>).</p>
Full article ">Figure 2
<p>Example of the relation between the Basis-ERIs and the GTO-ERIs. The row (column) directions represent the <span class="html-italic">bra</span> (<span class="html-italic">ket</span>) Basis-ERIs. Each cell in the upper triangular matrix corresponds to a single Basis-ERI.</p>
Full article ">Figure 3
<p>Basic idea behind the definition of shell-based ERIs. The term <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> </mrow> </semantics></math> implies that a <span class="html-italic">bra</span> consists of two s-shells, and the term <math display="inline"><semantics> <mrow> <mo>|</mo> <mi>s</mi> <mi>p</mi> <mo>)</mo> </mrow> </semantics></math> implies that a <span class="html-italic">ket</span> consists of one s-shell and one p-shell. The integral <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <mi>p</mi> <mo>)</mo> </mrow> </semantics></math> consists of three GTO-ERIs: <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>x</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>y</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>z</mi> </msub> <mo>]</mo> </mrow> </semantics></math>.</p>
Full article ">Figure 4
<p>Basic idea of the dependencies, denoted by arrows, behind computing the values of the corresponding recurrences <span class="html-italic">R</span> by using <span class="html-italic">batch</span> concepts when <math display="inline"><semantics> <mrow> <mi>K</mi> <mo>=</mo> <mn>4</mn> </mrow> </semantics></math>. The values required by the MD method are highlighted in <span style="color: #FF0000">red</span>.</p>
Full article ">Figure 5
<p>Basic idea of the computation of <span class="html-italic">R</span> values for each batch using triple-buffering of the shared memory.</p>
Full article ">Figure 6
<p>Comparison of the required size of <span class="html-italic">shared memory</span> to store <span class="html-italic">R</span> values.</p>
Full article ">Figure 7
<p>Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BBM.</p>
Full article ">Figure 8
<p>Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BTM.</p>
Full article ">Figure 9
<p>Parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in SBM.</p>
Full article ">Figure 10
<p>Basic idea behind the parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in STM.</p>
Full article ">Figure 11
<p>Schematic of the 64-bit key for sorting the Basis-ERIs.</p>
Full article ">
22 pages, 24659 KiB  
Article
A Multi-Scale Fusion Deep Learning Approach for Wind Field Retrieval Based on Geostationary Satellite Imagery
by Wei Zhang, Yapeng Wu, Kunkun Fan, Xiaojiang Song, Renbo Pang and Boyu Guoan
Remote Sens. 2025, 17(4), 610; https://doi.org/10.3390/rs17040610 - 11 Feb 2025
Viewed by 414
Abstract
Wind field retrieval, a crucial component of weather forecasting, has been significantly enhanced by recent advances in deep learning. However, existing approaches that are primarily focused on wind speed retrieval are limited by their inability to achieve real-time, full-coverage retrievals at large scales. [...] Read more.
Wind field retrieval, a crucial component of weather forecasting, has been significantly enhanced by recent advances in deep learning. However, existing approaches that are primarily focused on wind speed retrieval are limited by their inability to achieve real-time, full-coverage retrievals at large scales. To address this problem, we propose a novel multi-scale fusion retrieval (MFR) method, leveraging geostationary observation satellites. At the mesoscale, MFR incorporates a cloud-to-wind transformer model, which employs local self-attention mechanisms to extract detailed wind field features. At large scales, MFR incorporates a multi-encoder coordinate U-net model, which incorporates multiple encoders and utilises coordinate information to fuse meso- to large-scale features, enabling accurate and regionally complete wind field retrievals, while reducing the computational resources required. The MFR method was validated using Level 1 data from the Himawari-8 satellite, covering a geographic range of 0–60°N and 100–160°E, at a resolution of 0.25°. Wind field retrieval was accomplished within seconds using a single graphics processing unit. The mean absolute error of wind speed obtained by the MFR was 0.97 m/s, surpassing the accuracy of the CFOSAT and HY-2B Level 2B wind field products. The mean absolute error for wind direction achieved by the MFR was 23.31°, outperforming CFOSAT Level 2B products and aligning closely with HY-2B Level 2B products. The MFR represents a pioneering approach for generating initial fields for large-scale grid forecasting models. Full article
(This article belongs to the Special Issue Image Processing from Aerial and Satellite Imagery)
Show Figures

Figure 1

Figure 1
<p>Vector diagrams for (<b>a</b>) normal conditions, (<b>b</b>) Typhoon Hinnamnor, and (<b>c</b>) Typhoon Nanmadol in the study area. Arrow directions and lengths indicate wind direction and speed, respectively.</p>
Full article ">Figure 2
<p>Multi-scale fusion retrieval architecture. Res, resolution.</p>
Full article ">Figure 3
<p>Sliding window sampling method.</p>
Full article ">Figure 4
<p>Structure of the C2W-Former model.</p>
Full article ">Figure 5
<p>(<b>a</b>) The Swin Transformer block architecture; (<b>b</b>) W-MSA and SW-MSA, which are multi-head self-attention modules with regular and shifted windowing configurations, respectively. Blue and red boxes represent patches and windows, respectively.</p>
Full article ">Figure 6
<p>Discontinuous and blurred boundaries in the preliminary UV.</p>
Full article ">Figure 7
<p>Architecture of the Multi-encoder Coordinate U-net (M-CoordUnet) model. (<b>a</b>) Overall architecture. (<b>b</b>–<b>d</b>) Structural details of the encoder, centre, and decoder blocks, respectively.</p>
Full article ">Figure 8
<p>Analysis of MFR and OSR in comparison to ERA5 for land and sea regions.</p>
Full article ">Figure 9
<p>Scatter plots of each model versus ERA5 at 00:00 on 19 April 2021 (Super Typhoon Surigae). Closer proximity to the x = y line indicates better agreement between the two models. Warmer colours indicate higher frequency.</p>
Full article ">Figure 10
<p>Analysis of ERA5, IFS, and MFR results in comparison to weather station data.</p>
Full article ">Figure 11
<p>UV MAE statistics of MFR wind fields for land (green), sea (orange), and the total study area (blue) across different months and regions in the test data. Solid line indicates average error; shaded area indicates the 95% confidence interval.</p>
Full article ">Figure 12
<p>Comparison of wind field characteristics among different models and data products during Super Typhoon Surigae (19 April 2021, 00:00 UTC).</p>
Full article ">Figure 13
<p>Comparison of wind field characteristics among different models and data products during Super Typhoon Mindulle (28 September 2021, 12:00 UTC).</p>
Full article ">
20 pages, 732 KiB  
Article
VCONV: A Convolutional Neural Network Accelerator for FPGAs
by Srikanth Neelam and A. Amalin Prince
Electronics 2025, 14(4), 657; https://doi.org/10.3390/electronics14040657 - 8 Feb 2025
Viewed by 355
Abstract
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give [...] Read more.
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)
Show Figures

Figure 1

Figure 1
<p>Typical architecture of a convolutional neural network layer.</p>
Full article ">Figure 2
<p>CNN error rates vs. layers.</p>
Full article ">Figure 3
<p>Comparison of costs and resources between different FPGAs.</p>
Full article ">Figure 4
<p>VCONV IP: CNN accelerator architecture.</p>
Full article ">Figure 5
<p>Input image arrangement in LB.</p>
Full article ">Figure 6
<p>Input to multiple CEs at the same time with a multi-row line buffer.</p>
Full article ">Figure 7
<p>Inactive counters for monitoring idle time in MACs.</p>
Full article ">Figure 8
<p>VCONV engine integrated with DMA IP on SoC FPGA.</p>
Full article ">Figure 9
<p>Simulation capture of the VCONV IP’s functionality for AlexNet.</p>
Full article ">Figure 10
<p>Implementing AlexNet on different FPGAs at 30 fps.</p>
Full article ">
14 pages, 2560 KiB  
Article
Novel GPU-Based Method for the Generalized Maximum Flow Problem
by Delia Elena Spridon, Adrian Marius Deaconu and Javad Tayyebi
Computation 2025, 13(2), 40; https://doi.org/10.3390/computation13020040 - 5 Feb 2025
Viewed by 335
Abstract
This paper investigates the application of a minimum loss path finding algorithm to determine the maximum flow in generalized networks that are characterized by arc losses or gains. In these generalized network flow problems, each arc has not only a defined capacity but [...] Read more.
This paper investigates the application of a minimum loss path finding algorithm to determine the maximum flow in generalized networks that are characterized by arc losses or gains. In these generalized network flow problems, each arc has not only a defined capacity but also a loss or gain factor, which must be taken into consideration when calculating the maximum achievable flow. This extension of the traditional maximum flow problem requires a more comprehensive approach, where the maximum amount of flow is determined by accounting for additional factors such as costs, varying arc capacities, and the specific loss or gain associated with each arc. This paper extends the classic Ford–Fulkerson algorithm, adapting it to iteratively identify source-to-sink (s − t) residual directed paths with minimum cumulative loss and generalized augmenting paths (GAPs), thus enabling the efficient computation of maximum flow in such complex networks. Moreover, to enhance the computational performance of the proposed algorithm, we conducted extensive studies on parallelization techniques using graphics processing units (GPUs). Significant improvements in the algorithm’s efficiency and scalability were achieved. The results demonstrate the potential of GPU-accelerated computations in handling real-world applications where generalized network flows with arc losses and gains are prevalent, such as in telecommunications, transportation, or logistics networks. Full article
Show Figures

Figure 1

Figure 1
<p>Example of a network with losses.</p>
Full article ">Figure 2
<p>Example illustrating the determination of maximum flow in a network with losses. In order to highlight the progress of the algorithm, the s-t paths and cycle found by the algorithm are presented in red at each iteration.</p>
Full article ">Figure 3
<p>Execution times for Algorithm 1 in dense networks.</p>
Full article ">Figure 4
<p>Execution times for Algorithm 1 for different network densities.</p>
Full article ">
35 pages, 2222 KiB  
Article
Multithreaded and GPU-Based Implementations of a Modified Particle Swarm Optimization Algorithm with Application to Solving Large-Scale Systems of Nonlinear Equations
by Bruno Silva, Luiz Guerreiro Lopes and Fábio Mendonça
Electronics 2025, 14(3), 584; https://doi.org/10.3390/electronics14030584 - 1 Feb 2025
Viewed by 480
Abstract
This paper presents a novel Graphics Processing Unit (GPU) accelerated implementation of a modified Particle Swarm Optimization (PSO) algorithm specifically designed to solve large-scale Systems of Nonlinear Equations (SNEs). The proposed GPU-based parallel version of the PSO algorithm uses the inherent parallelism of [...] Read more.
This paper presents a novel Graphics Processing Unit (GPU) accelerated implementation of a modified Particle Swarm Optimization (PSO) algorithm specifically designed to solve large-scale Systems of Nonlinear Equations (SNEs). The proposed GPU-based parallel version of the PSO algorithm uses the inherent parallelism of modern hardware architectures. Its performance is compared against both sequential and multithreaded Central Processing Unit (CPU) implementations. The primary objective is to evaluate the efficiency and scalability of PSO across different hardware platforms with a focus on solving large-scale SNEs involving thousands of equations and variables. The GPU-parallelized and multithreaded versions of the algorithm were implemented in the Julia programming language. Performance analyses were conducted on an NVIDIA A100 GPU and an AMD EPYC 7643 CPU. The tests utilized a set of challenging, scalable SNEs with dimensions ranging from 1000 to 5000. Results demonstrate that the GPU accelerated modified PSO substantially outperforms its CPU counterparts, achieving substantial speedups and consistently surpassing the highly optimized multithreaded CPU implementation in terms of computation time and scalability as the problem size increases. Therefore, this work evaluates the trade-offs between different hardware platforms and underscores the potential of GPU-based parallelism for accelerating SNE solvers. Full article
Show Figures

Figure 1

Figure 1
<p>Execution pipeline of the multithreaded CPU implementation of the PPSO algorithm.</p>
Full article ">Figure 2
<p>Execution pipeline for the GPU-based implementation of the PPSO algorithm.</p>
Full article ">Figure 3
<p>Multithreaded performance analysis by problem dimension: (<b>a</b>) 1000, (<b>b</b>) 2000, (<b>c</b>) 3000, (<b>d</b>) 4000, (<b>e</b>) 5000. Mean processing time and speedup ratio as functions of the number of threads (1 thread corresponds to sequential execution).</p>
Full article ">Figure 4
<p>Mean aggregated speedups and Amdahl’s law theoretical predictions.</p>
Full article ">Figure 5
<p>GPU parallelization performance for each test problem: speedup ratios for FP32 and FP64 by problem dimension, relative to sequential and 128-threaded executions. Subfigures (<b>a</b>,<b>c</b>) show FP32 speedup: (<b>a</b>) sequential, (<b>c</b>) 128-threaded; subfigures (<b>b</b>,<b>d</b>) show FP64 speedup: (<b>b</b>) sequential, (<b>d</b>) 128-threaded.</p>
Full article ">Figure 6
<p>Mean GPU-based parallelization performance relative to sequential (<b>left</b>) and 128-threaded (<b>right</b>) executions: mean speedup ratios for FP32 and FP64 as functions of problem size.</p>
Full article ">
18 pages, 3106 KiB  
Article
An FPGA-Based Hybrid Overlapping Acceleration Architecture for Small-Target Remote Sensing Detection
by Nan Fang, Liyuan Li, Xiaoxuan Zhou, Wencong Zhang and Fansheng Chen
Remote Sens. 2025, 17(3), 494; https://doi.org/10.3390/rs17030494 - 31 Jan 2025
Viewed by 578
Abstract
Small-object detection in satellite remote sensing images plays a pivotal role in the field of remote sensing. Achieving high-performance real-time detection demands not only efficient algorithms but also low-power, high-performance hardware platforms. However, most mainstream target detection methods currently rely on graphics processing [...] Read more.
Small-object detection in satellite remote sensing images plays a pivotal role in the field of remote sensing. Achieving high-performance real-time detection demands not only efficient algorithms but also low-power, high-performance hardware platforms. However, most mainstream target detection methods currently rely on graphics processing units (GPUs) for acceleration, and the high power consumption of GPUs limits their use in resource-constrained platforms such as small satellites. Moreover, small-object detection faces multiple challenges: the targets occupy only a small number of pixels in the image, the background is often complex with significant noise interference, and existing detection models typically exhibit low accuracy when dealing with small targets. In addition, the large number of parameters in these models makes direct deployment on embedded devices difficult. To address these issues, we propose a hybrid overlapping acceleration architecture based on FPGA, along with a lightweight model derived from YOLOv5s that is specifically designed to enhance the detection of small objects in remote sensing images. This model incorporates a lightweight GhostBottleneckV2 module, significantly reducing both model parameters and computational complexity. Experimental results on the TIFAD thermal infrared small-object dataset show that our approach achieves an average precision (mAP) of 67.8% while consuming an average power of only 2.8 W. The robustness of the proposed model is verified by the HRSID dataset. Combining real-time performance with high energy efficiency, this architecture is particularly well suited for on-board remote sensing image processing systems, where reliable and efficient small-object detection is paramount. Full article
Show Figures

Figure 1

Figure 1
<p>The architecture of YOLOv5s [<a href="#B23-remotesensing-17-00494" class="html-bibr">23</a>].</p>
Full article ">Figure 2
<p>Bottleneck structure diagram of GhostNetV2 [<a href="#B26-remotesensing-17-00494" class="html-bibr">26</a>]: (<b>a</b>) bottleneck with a step length of 1; (<b>b</b>) Bottleneck with a step length of 2; (<b>c</b>) DFC attention. The Ghost module and DFC attention operate as two parallel branches, each extracting information from a different perspective.</p>
Full article ">Figure 3
<p>GF-YOLO structure.</p>
Full article ">Figure 4
<p>Data flow diagram.</p>
Full article ">Figure 5
<p>GF-YOLO detection plot on the HRSID.</p>
Full article ">Figure 6
<p>GF-YOLO detection plot on the TIFAD.</p>
Full article ">
33 pages, 19016 KiB  
Article
Multitask Learning-Based Pipeline-Parallel Computation Offloading Architecture for Deep Face Analysis
by Faris S. Alghareb and Balqees Talal Hasan
Computers 2025, 14(1), 29; https://doi.org/10.3390/computers14010029 - 20 Jan 2025
Viewed by 1213
Abstract
Deep Neural Networks (DNNs) have been widely adopted in several advanced artificial intelligence applications due to their competitive accuracy to the human brain. Nevertheless, the superior accuracy of a DNN is achieved at the expense of intensive computations and storage complexity, requiring custom [...] Read more.
Deep Neural Networks (DNNs) have been widely adopted in several advanced artificial intelligence applications due to their competitive accuracy to the human brain. Nevertheless, the superior accuracy of a DNN is achieved at the expense of intensive computations and storage complexity, requiring custom expandable hardware, i.e., graphics processing units (GPUs). Interestingly, leveraging the synergy of parallelism and edge computing can significantly improve CPU-based hardware platforms. Therefore, this manuscript explores levels of parallelism techniques along with edge computation offloading to develop an innovative hardware platform that improves the efficacy of deep learning computing architectures. Furthermore, the multitask learning (MTL) approach is employed to construct a parallel multi-task classification network. These tasks include face detection and recognition, age estimation, gender recognition, smile detection, and hair color and style classification. Additionally, both pipeline and parallel processing techniques are utilized to expedite complicated computations, boosting the overall performance of the presented deep face analysis architecture. A computation offloading approach, on the other hand, is leveraged to distribute computation-intensive tasks to the server edge, whereas lightweight computations are offloaded to edge devices, i.e., Raspberry Pi 4. To train the proposed deep face analysis network architecture, two custom datasets (HDDB and FRAED) were created for head detection and face-age recognition. Extensive experimental results demonstrate the efficacy of the proposed pipeline-parallel architecture in terms of execution time. It requires 8.2 s to provide detailed face detection and analysis for an individual and 23.59 s for an inference containing 10 individuals. Moreover, a speedup of 62.48% is achieved compared to the sequential-based edge computing architecture. Meanwhile, 25.96% speed performance acceleration is realized when implementing the proposed pipeline-parallel architecture only on the server edge compared to the sever sequential implementation. Considering classification efficiency, the proposed classification modules achieve an accuracy of 88.55% for hair color and style classification and a remarkable prediction outcome of 100% for face recognition and age estimation. To summarize, the proposed approach can assist in reducing the required execution time and memory capacity by processing all facial tasks simultaneously on a single deep neural network rather than building a CNN model for each task. Therefore, the presented pipeline-parallel architecture can be a cost-effective framework for real-time computer vision applications implemented on resource-limited devices. Full article
Show Figures

Figure 1

Figure 1
<p>Head detection dataset versus face detection dataset using nano-based YOLOv8.</p>
Full article ">Figure 2
<p>Sample images from the hair dataset used to train the hair color-style module.</p>
Full article ">Figure 3
<p>Selected image samples of the created face recognition and age estimation dataset.</p>
Full article ">Figure 4
<p>The general framework of the proposed deep face analysis architecture.</p>
Full article ">Figure 5
<p>Stages of the pipeline-multithreading architecture, showing four images being processed in parallel.</p>
Full article ">Figure 6
<p>Proposed pipeline-parallel architectures with thread distributions; (<b>a</b>) multithreading three MTL-based classifiers on a single edge device, (<b>b</b>) multithreading three MTL-based classifiers on a cluster containing three edge computing devices.</p>
Full article ">Figure 7
<p>Modified VGG-Face network to support the multitask classification approach.</p>
Full article ">Figure 8
<p>Offloading feature maps of detected heads to edge devices using multithreading.</p>
Full article ">Figure 9
<p>Multithreading of parallel modules on edge server and edge node processors.</p>
Full article ">Figure 10
<p>The framework of system deployment for the proposed deep face analysis.</p>
Full article ">Figure 11
<p>Training and validation performance of the YOLOv8 model for head detection. The x-axis represents the number of epochs.</p>
Full article ">Figure 12
<p>YOLOv8 testing performance; (<b>a</b>) confusion matrix, (<b>b</b>) precision, (<b>c</b>) recall, (<b>d</b>) precision-recall, and (<b>e</b>) F1 score confidence curve.</p>
Full article ">Figure 13
<p>Head detection result samples of YOLOv8, where a red box denotes a detected head with its corresponding confidence level.</p>
Full article ">Figure 14
<p>Confusion matrices for classification modules using STL and MTL; (<b>a</b>) hair color STL, (<b>b</b>) hair color MTL, (<b>c</b>) hairstyle STL, (<b>d</b>) hairstyle MTL, (<b>e</b>) gender STL, (<b>f</b>) gender MTL, (<b>g</b>) smile STL, (<b>h</b>) smile MTL, (<b>i</b>) Face STL, (<b>j</b>) Face MTL, (<b>k</b>) age STL, and (<b>l</b>) age MTL module.</p>
Full article ">Figure 15
<p>Speed performance evaluation of the proposed pipeline-parallel architecture; (<b>a</b>) execution time for pipeline-parallel configurations versus sequential implementation, (<b>b</b>) speedup comparisons of implemented configurations.</p>
Full article ">
22 pages, 9154 KiB  
Article
Turbulent Flow Through Sluice Gate and Weir Using Smoothed Particle Hydrodynamics: Evaluation of Turbulence Models, Boundary Conditions, and 3D Effects
by Efstathios Chatzoglou and Antonios Liakopoulos
Water 2025, 17(2), 152; https://doi.org/10.3390/w17020152 - 8 Jan 2025
Viewed by 720
Abstract
Understanding flow dynamics around hydraulic structures is essential for optimizing water management systems and predicting flow behavior in real-world applications. In this study, we simulate a 3D flow control system featuring a sluice gate and a weir, commonly used in hydraulic engineering. The [...] Read more.
Understanding flow dynamics around hydraulic structures is essential for optimizing water management systems and predicting flow behavior in real-world applications. In this study, we simulate a 3D flow control system featuring a sluice gate and a weir, commonly used in hydraulic engineering. The focus is on accurately incorporating modified dynamic boundary conditions (mDBCs) and viscosity treatment to improve the simulation of complex, turbulent flows. We assess the performance of the Smoothed Particle Hydrodynamics (SPH) method in handling these challenging conditions. Especially when the boundary conditions and applicability to industry are two of the SPH method’s grand challenges. Simulations were conducted on a Graphics Processing Unit (GPU) using the DualSPHysics code. The results were compared to theoretical predictions and experimental data found in the literature. Key hydraulic characteristics, including 3D flow effects, hydraulic jump formation, and turbulent behavior, are examined. The combination of mDBCs with the Laminar plus sub-particle scale turbulence model achieved the correct simulation results. The findings demonstrate agreement between simulations, theoretical predictions, and experimental results. This work provides a reliable framework for analyzing turbulent flows in hydraulic structures and can be used as reference data or a prototype for larger-scale simulations in both research and engineering design, particularly in contexts requiring robust and precise flow control and/or environmental management. Full article
(This article belongs to the Special Issue Hydrodynamic Science Experiments and Simulations)
Show Figures

Figure 1

Figure 1
<p>mDBC representation. The mirroring of ghost nodes (crosses) and the kernel radius around the ghost nodes for boundary particles in a flat surface and a corner.</p>
Full article ">Figure 2
<p>Simulation model set up. (<b>a</b>) L<sub>1</sub> = 2.02 m, L<sub>2</sub> = 4.12, H<sub>weir</sub> = the height of the weir depending on the case, Y<sub>U</sub> = upstream initial water depth = 20 cm, and Y<sub>G</sub> = 0.025 m is the gate opening; (<b>b</b>) solid wall particles and the initial arrangement of liquid particles.</p>
Full article ">Figure 3
<p>AVM. The transient formation of the hydraulic jump. (<b>a</b>) T = 50 s, (<b>b</b>) T = 100 s, (<b>c</b>) T = 150 s, (<b>d</b>) T = 200 s.</p>
Full article ">Figure 4
<p>The definition sketch for corner vortices upstream of the sluice gate (from [<a href="#B42-water-17-00152" class="html-bibr">42</a>]).</p>
Full article ">Figure 5
<p>AVM. The transient formation of the coherent vortices upstream of the sluice gate, as seen on a horizontal plane at Z = 18 cm, (<b>a</b>) T = 100 s, (<b>b</b>) T = 150 s, (<b>c</b>) T = 200 s. AA’ sluice gate.</p>
Full article ">Figure 6
<p>AVM. X-component of velocity. Front view. X = 1.81 m. Snapshots at T = 50, T = 100, T = 150, T = 200 s.</p>
Full article ">Figure 7
<p>AVM. X-component of velocity. Front view. X = 3 m. Snapshots at T = 50, T = 100, T = 150, T = 200 s.</p>
Full article ">Figure 8
<p>AVM. The velocity profiles at the selected longitudinal positions on the plane vertical mid-plane of the channel, Y = 0.075 m. (<b>a</b>) The location upstream of the sluice gate, X = 1.81 m. (<b>b</b>) The locations downstream of the sluice gate. T = 200 s, flow steady in the mean.</p>
Full article ">Figure 9
<p>AVM. The side view Y = 0.075 m. Characteristics of the hydraulic jump at T = 200 s.</p>
Full article ">Figure 10
<p>L-SPS. The transient formation of the hydraulic jump. (<b>a</b>) T = 50 s, (<b>b</b>) T = 100 s, (<b>c</b>) T = 150 s, (<b>d</b>) T = 200 s.</p>
Full article ">Figure 11
<p>L-SPS. The transient formation of coherent vortices upstream of the sluice gate as seen on a horizontal plane at Z = 0.18 m, (<b>a</b>) T = 100 s, (<b>b</b>) T = 150 s, (<b>c</b>) T = 200 s. AA’ the sluice gate.</p>
Full article ">Figure 12
<p>L-SPS. X-component of velocity. X = 1.81 m. Front view. Snapshots at T = 50, T = 100, T = 150, T = 200 s.</p>
Full article ">Figure 13
<p>L-SPS. X-component of velocity at X = 3 m. Front view. Snapshots at T = 50, T = 100, T = 150, T = 200 s.</p>
Full article ">Figure 14
<p>L-SPS. The velocity profile at selected longitudinal locations on plane Y = 0.075 m (mid vertical plane of the channel). (<b>a</b>) The position upstream of the sluice gate, X = 1.81 m, and (<b>b</b>) the positions downstream of the sluice gate. T = 200 s, steady in the mean flow.</p>
Full article ">Figure 15
<p>L-SPS. Streamlines at Y = 0.075 m (channel mid vertical plane), T = 200 s.</p>
Full article ">Figure 16
<p>Comparison of SPH results with experimental data [<a href="#B32-water-17-00152" class="html-bibr">32</a>] and Swamee [<a href="#B46-water-17-00152" class="html-bibr">46</a>] empirical relations.</p>
Full article ">Figure A1
<p>AVM. The time series of the X-component of velocity at X = 1.81 m (upstream of the sluice gate). (<b>a</b>) Point X = 1.81 m, Y = 0.075 m, Z = 0.005 m; (<b>b</b>) Point X = 1.81 m, Y = 0.075 m, Z = 0.14 m; (<b>c</b>) Point X = 1.81 m, Y = 0.075 m, Z = 0.18 m.</p>
Full article ">Figure A2
<p>AVM. The time series of the X-component of velocity at X = 3 m (downstream of the hydraulic jump). (<b>a</b>) Point X = 3 m, Y = 0.075 m, Z = 0.005 m; (<b>b</b>) Point X = 3 m, Y = 0.075 m, Z = 0.035 m; (<b>c</b>) Point X = 3 m, Y = 0.075 m, Z = 0.05 m.</p>
Full article ">Figure A3
<p>The visualization of L-SPS at Y = 0.075 m, showing flow characteristics on a clipped plane with projections on the channel’s back wall. The locations of measurement points (A1–A3, B1–B3) at X = 1.8 m and X = 3 m are highlighted.</p>
Full article ">Figure A4
<p>L-SPS. The time series of the X-component of velocity at X = 1.81 m (upstream of the sluice gate). (<b>a</b>) Point A1, X = 1.81 m, Y = 0.075 m, Z = 0.005 m; (<b>b</b>) Point A2, X = 1.81 m, Y = 0.075 m, Z = 0.14 m; (<b>c</b>) Point A3, X = 1.81 m, Y = 0.075 m, Z = 0.18 m; (<b>d</b>) all points.</p>
Full article ">Figure A5
<p>L-SPS. The time series of the X-component of velocity at X = 3 m (downstream of the hydraulic jump). (<b>a</b>) Point B1, X = 3 m, Y = 0.075 m, Z = 0.005 m; (<b>b</b>) Point B2, X = 3 m, Y = 0.075 m, Z = 0.035 m; (<b>c</b>) Point B3, X = 3 m, Y = 0.075 m, Z = 0.05 m; (<b>d</b>) all points.</p>
Full article ">
34 pages, 9890 KiB  
Article
Synchronized Delay Measurement of Multi-Stream Analysis over Data Concentrator Units
by Anvarjon Yusupov, Sun Park and JongWon Kim
Electronics 2025, 14(1), 81; https://doi.org/10.3390/electronics14010081 - 27 Dec 2024
Viewed by 652
Abstract
Autonomous vehicles (AVs) rely heavily on multi-modal sensors to perceive their surroundings and make real-time decisions. However, the increasing complexity of these sensors, combined with the computational demands of AI models and the challenges of synchronizing data across multiple inputs, presents significant obstacles [...] Read more.
Autonomous vehicles (AVs) rely heavily on multi-modal sensors to perceive their surroundings and make real-time decisions. However, the increasing complexity of these sensors, combined with the computational demands of AI models and the challenges of synchronizing data across multiple inputs, presents significant obstacles for AV systems. These challenges of the AV domain often lead to performance latency, resulting in delayed decision-making, causing major traffic accidents. The data concentrator unit (DCU) concept addresses these issues by optimizing data pipelines and implementing intelligent control mechanisms to process sensor data efficiently. Identifying and addressing bottlenecks that contribute to latency can enhance system performance, reducing the need for costly hardware upgrades or advanced AI models. This paper introduces a delay measurement tool for multi-node analysis, enabling synchronized monitoring of data pipelines across connected hardware platforms, such as clock-synchronized DCUs. The proposed tool traces the execution flow of software applications and assesses time delays at various stages of the data pipeline in clock-synchronized hardware. The various stages are represented with intuitive graphical visualization, simplifying the identification of performance bottlenecks. Full article
(This article belongs to the Special Issue Advancements in Connected and Autonomous Vehicles)
Show Figures

Figure 1

Figure 1
<p>Conceptual diagram of AI-integrated V2X-Car Edge Cloud.</p>
Full article ">Figure 2
<p>Conceptual diagram of the delay measurement tool.</p>
Full article ">Figure 3
<p>Design of software tracing module.</p>
Full article ">Figure 4
<p>Processing design of delay calculation and visualization module.</p>
Full article ">Figure 5
<p>Design of synchronized cameras and DCU connection diagram for high availability (HA).</p>
Full article ">Figure 6
<p>PTP master clock initialization on the monitoring system.</p>
Full article ">Figure 7
<p>PTP slave clock initialization on the target system.</p>
Full article ">Figure 8
<p>Synchronization between the hardware clock and the system clock on the target system.</p>
Full article ">Figure 9
<p>Leveraging open-source software for the delay measurement tool.</p>
Full article ">Figure 10
<p>Prototype of the delay measurement environment with H/W and S/W.</p>
Full article ">Figure 11
<p>DCU load test by increasing the number of video files.</p>
Full article ">Figure 12
<p>Comparing the results of delay measurement times.</p>
Full article ">Figure 13
<p>Task delay visualization GUI.</p>
Full article ">Figure 14
<p>Frame-processing delay visualization GUI.</p>
Full article ">
40 pages, 1079 KiB  
Article
Context-Adaptable Deployment of FastSLAM 2.0 on Graphic Processing Unit with Unknown Data Association
by Jessica Giovagnola, Manuel Pegalajar Cuéllar and Diego Pedro Morales Santos
Appl. Sci. 2024, 14(23), 11466; https://doi.org/10.3390/app142311466 - 9 Dec 2024
Viewed by 1071
Abstract
Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient [...] Read more.
Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient performance. While SLAM solutions aim at ensuring accurate and timely localization and mapping, one of their main limitations is their computational complexity. In this scenario, particle filter-based approaches such as FastSLAM 2.0 can significantly benefit from parallel programming due to their modular construction. The parallelization process involves identifying the parameters affecting the computational complexity in order to distribute the computation among single multiprocessors as efficiently as possible. However, the computational complexity of methodologies such as FastSLAM 2.0 can depend on multiple parameters whose values may, in turn, depend on each specific use case scenario ( ingi.e., the context), leading to multiple possible parallelization designs. Furthermore, the features of the hardware architecture in use can significantly influence the performance in terms of latency. Therefore, the selection of the optimal parallelization modality still needs to be empirically determined. This may involve redesigning the parallel algorithm depending on the context and the hardware architecture. In this paper, we propose a CUDA-based adaptable design for FastSLAM 2.0 on GPU, in combination with an evaluation methodology that enables the assessment of the optimal parallelization modality based on the context and the hardware architecture without the need for the creation of separate designs. The proposed implementation includes the parallelization of all the functional blocks of the FastSLAM 2.0 pipeline. Additionally, we contribute a parallelized design of the data association step through the Joint Compatibility Branch and Bound (JCBB) method. Multiple resampling algorithms are also included to accommodate the needs of a wide variety of navigation scenarios. Full article
Show Figures

Figure 1

Figure 1
<p>FastSLAM 2.0 pipeline.</p>
Full article ">Figure 2
<p>Observation model—graphical representation.</p>
Full article ">Figure 3
<p>Hardware–software architecture schema.</p>
Full article ">Figure 4
<p>Simulation environment schema.</p>
Full article ">Figure 5
<p>Functional blocks partitioning schema.</p>
Full article ">Figure 6
<p>Detailed heterogeneous architecture pipeline.</p>
Full article ">Figure 7
<p>Data association pipeline.</p>
Full article ">Figure 8
<p>Particle Initialization—elapsed time.</p>
Full article ">Figure 9
<p>Particle Prediction—elapsed time.</p>
Full article ">Figure 10
<p>Mahalanobis Distance—elapsed time.</p>
Full article ">Figure 11
<p>Problem Preparation—elapsed time.</p>
Full article ">Figure 12
<p>Branch and Bound—elapsed time.</p>
Full article ">Figure 13
<p>Proposal Adjustment—elapsed time.</p>
Full article ">Figure 14
<p>Landmark Estimation—elapsed time.</p>
Full article ">Figure 15
<p>Resampling traditional methods—elapsed time.</p>
Full article ">Figure 16
<p>Resampling alternative methods—elapsed time.</p>
Full article ">
17 pages, 3121 KiB  
Article
Real-Time Radar Classification Based on Software-Defined Radio Platforms: Enhancing Processing Speed and Accuracy with Graphics Processing Unit Acceleration
by Seckin Oncu, Mehmet Karakaya, Yaser Dalveren, Ali Kara and Mohammad Derawi
Sensors 2024, 24(23), 7776; https://doi.org/10.3390/s24237776 - 4 Dec 2024
Viewed by 974
Abstract
This paper presents a comprehensive evaluation of real-time radar classification using software-defined radio (SDR) platforms. The transition from analog to digital technologies, facilitated by SDR, has revolutionized radio systems, offering unprecedented flexibility and reconfigurability through software-based operations. This advancement complements the role of [...] Read more.
This paper presents a comprehensive evaluation of real-time radar classification using software-defined radio (SDR) platforms. The transition from analog to digital technologies, facilitated by SDR, has revolutionized radio systems, offering unprecedented flexibility and reconfigurability through software-based operations. This advancement complements the role of radar signal parameters, encapsulated in the pulse description words (PDWs), which play a pivotal role in electronic support measure (ESM) systems, enabling the detection and classification of threat radars. This study proposes an SDR-based radar classification system that achieves real-time operation with enhanced processing speed. Employing the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm as a robust classifier, the system harnesses Graphical Processing Unit (GPU) parallelization for efficient radio frequency (RF) parameter extraction. The experimental results highlight the efficiency of this approach, demonstrating a notable improvement in processing speed while operating at a sampling rate of up to 200 MSps and achieving an accuracy of 89.7% for real-time radar classification. Full article
(This article belongs to the Section Radar Sensors)
Show Figures

Figure 1

Figure 1
<p>Illustration of a functional diagram of an ESM system.</p>
Full article ">Figure 2
<p>Functional structure of an SDR receiver.</p>
Full article ">Figure 3
<p>IQ Implementation of DCR.</p>
Full article ">Figure 4
<p>Illustration of radar pulse and some basic parameters.</p>
Full article ">Figure 5
<p>Illustration of the experimental setup.</p>
Full article ">Figure 6
<p>Frequency measurement results of test scenario.</p>
Full article ">Figure 7
<p>Flowchart of the proposed radar classification algorithm.</p>
Full article ">Figure 8
<p>Clusters in PW-RF plane.</p>
Full article ">Figure 9
<p>Clusters in PW-PA domain.</p>
Full article ">
3597 KiB  
Proceeding Paper
A Tool for Improved Monitoring of Acoustic Beacons and Receivers of the KM3NeT Neutrino Telescope
by Letizia Stella Di Mauro, Dídac Diego-Tortosa, Giorgio Riccobene and Salvatore Viola
Eng. Proc. 2024, 82(1), 77; https://doi.org/10.3390/ecsa-11-20490 - 26 Nov 2024
Viewed by 149
Abstract
KM3NeT is an underwater neutrino detector currently under construction. Since the installation of its first detection unit in 2015, it has been continuously collecting data. Due to its complex design comprising a 3D array of sensors, an Acoustic Positioning System (APS) has been [...] Read more.
KM3NeT is an underwater neutrino detector currently under construction. Since the installation of its first detection unit in 2015, it has been continuously collecting data. Due to its complex design comprising a 3D array of sensors, an Acoustic Positioning System (APS) has been developed to monitor the position of each sensor. Given the increasing number of acoustic sensors used for the APS, both receivers and emitters, a solution has been implemented to check their status. In this contribution, a monitoring tool for this instrumentation is presented, capable of evaluating its status at both the data and operational levels. For effective monitoring, it is crucial to associate the signal recorded by a receiver with the corresponding transmitter. The Acoustic Data Filter (ADF) performs a cross-correlation between the signals retained in a buffer and those emitted by each installed emitter. It saves the maximum peak value and its associated time of arrival for each expected signal. However, the growing number of beacons complicates the differentiation of corresponding transmitters due to the huge amount of data recorded by the ADF needing post-processing. To address this challenge, a monitoring tool is developed that analyzes the internal clock of each emitter to distinguish and filter the data collected by the ADF. This tool has proven to be highly effective at verifying the correct operation of all acoustic devices deployed at sea. The acoustic monitoring graphical output produced for each data slot facilitates quick failure detection, enabling a swift response. Last but not least, the tool is modular and scalable, adapting to the addition or removal of sensors from the detector. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) Acoustic beacon MAB100 (aluminium version) produced by MSM, Valencia, Spain; (<b>b</b>) DG0330 hydrophone, produced by Colmar, La Spezia, Italy; (<b>c</b>) Pz27 encapsuled piezoceramic sensor, assembled by GCD-PCB-Design GmbH, Erlangen, Germany.</p>
Full article ">Figure 2
<p>(<b>a</b>) D0ARCA028 footprint; (<b>b</b>) D1ORCA019 footprint.</p>
Full article ">Figure 3
<p>Modulus between the selected ToAs and the RR plotted as a function of time. The correct ToAs (highlighted in green) align along a straight line. This example illustrates the APS data from the ARCA hydrophone in DU26 related to WF14.</p>
Full article ">Figure 4
<p>Modulus between the selected ToAs and the RR plotted as a function of time. This example illustrates the APS data from the ARCA hydrophone in DU1 related to WF33.</p>
Full article ">Figure 5
<p>Modulus between the selected ToAs and the RR plotted as a function of time. This example illustrates the APS data from the ARCA hydrophone in DU28 related to WF33 using (<b>a</b>) <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>, where all the data correspond to incorrect ToAs, and (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math>, where there are some correct ToAs.</p>
Full article ">Figure 6
<p>Modulus of the selected ToAs and the RR plotted as a function of time. (<b>a</b>) Incorrect ToAs. Data from the ARCA hydrophone in DU19, where ToAs deviate from the expected alignment, suggesting a potential issue with this hydrophone. (<b>b</b>) Correct ToAs. Data from the ARCA hydrophone in DU22, where ToAs align correctly along the expected straight line, indicating proper functioning of the AB.</p>
Full article ">Figure 7
<p>Example of a cumulative acoustic monitoring plot over multiple runs for ORCA.</p>
Full article ">Figure 8
<p>Example of a cumulative acoustic monitoring plot over multiple runs for ARCA.</p>
Full article ">
23 pages, 8542 KiB  
Article
Graphics Processing Unit-Accelerated Propeller Computational Fluid Dynamics Using AmgX: Performance Analysis Across Mesh Types and Hardware Configurations
by Yue Zhu, Jin Gan, Yongshui Lin and Weiguo Wu
J. Mar. Sci. Eng. 2024, 12(12), 2134; https://doi.org/10.3390/jmse12122134 - 22 Nov 2024
Viewed by 743
Abstract
Computational fluid dynamics (CFD) has become increasingly prevalent in marine and offshore engineering, with enhancing simulation efficiency emerging as a critical challenge. This study systematically evaluates the application of graphics processing unit (GPU) acceleration technology in CFD simulation of propeller open water performance. [...] Read more.
Computational fluid dynamics (CFD) has become increasingly prevalent in marine and offshore engineering, with enhancing simulation efficiency emerging as a critical challenge. This study systematically evaluates the application of graphics processing unit (GPU) acceleration technology in CFD simulation of propeller open water performance. Numerical simulations of the VP1304 propeller model were performed using OpenFOAM v2312 integrated with the NVIDIA AmgX library. The research compared GPU acceleration performance against conventional CPU methods across various hardware configurations and mesh types (tetrahedral, hexahedral-dominant, and polyhedral). Results demonstrate that GPU acceleration significantly improved computational efficiency, with tetrahedral meshes achieving over 400% speedup in a 4-GPU configuration, while polyhedral meshes reached over 500% speedup with a fixed mesh count. Among the mesh types, hexahedral-dominant meshes performed best in capturing flow field details. The study also found that GPU acceleration does not compromise simulation accuracy, but its effectiveness is closely related to mesh type and hardware configuration. Notably, GPUs demonstrate more significant advantages when handling large-scale problems. These findings have important practical implications for improving propeller design processes and shortening product development cycles. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

Figure 1
<p>CFD simulation process accelerated through GPU in OpenFOAM.</p>
Full article ">Figure 2
<p>Geometric model of propeller VP1304. (<b>a</b>) Front view; (<b>b</b>) side view.</p>
Full article ">Figure 3
<p>Numerical simulation domain for open water performance of VP1304 propeller.</p>
Full article ">Figure 4
<p>Details of CFD mesh refinement.</p>
Full article ">Figure 5
<p>Comparative illustration of three computational domain dimensions (small, medium, and large).</p>
Full article ">Figure 6
<p>Different mesh types for CFD simulations. (<b>a</b>) Tetrahedral mesh; (<b>b</b>) hex-dominant mesh; (<b>c</b>) polyhedral mesh.</p>
Full article ">Figure 7
<p>Comparison of open water performance of propeller between simulation results on different hardware platforms and experimental data.</p>
Full article ">Figure 8
<p>Comparison of open water performance of propeller between simulation results using different grid types and experimental data, with a base size of 4.5 mm.</p>
Full article ">Figure 9
<p>Pressure distribution contour plots for different mesh types. (<b>a</b>) Tetrahedral mesh; (<b>b</b>) hex-dominant mesh; (<b>c</b>) polyhedral mesh.</p>
Full article ">Figure 10
<p>Vorticity distribution for different mesh types. (<b>a</b>) Tetrahedral mesh; (<b>b</b>) hex-dominant mesh; (<b>c</b>) polyhedral mesh.</p>
Full article ">Figure 11
<p>Velocity distribution for different mesh types. (<b>a</b>) Tetrahedral mesh; (<b>b</b>) hex-dominant mesh; (<b>c</b>) polyhedral mesh.</p>
Full article ">Figure 12
<p>Simulation time versus number of CPU cores for different mesh types. (<b>a</b>) Fixed mesh size (4.5 mm). (<b>b</b>) Fixed mesh count (3.3 million elements).</p>
Full article ">Figure 13
<p><a href="#jmse-12-02134-f012" class="html-fig">Figure 12</a> speedup factor versus number of CPU cores for different element types. (<b>a</b>) Fixed mesh size (4.5 mm). (<b>b</b>) Fixed mesh count (3.3 million elements).</p>
Full article ">Figure 14
<p>Simulation time versus number of GPUs for different mesh types. (<b>a</b>) Fixed mesh size (4.5 mm). (<b>b</b>) Fixed mesh count (3.3 million elements).</p>
Full article ">Figure 15
<p>Speedup factor versus number of GPUs for different mesh types. (<b>a</b>) Fixed mesh size (4.5 mm). (<b>b</b>) Fixed mesh count (3.3 million elements).</p>
Full article ">Figure 16
<p>Speedup of different numbers of GPUs compared to 32-core CPU with consistent mesh size.</p>
Full article ">Figure 17
<p>Speedup of different numbers of GPUs compared to 32-core CPU with consistent mesh number.</p>
Full article ">
20 pages, 5217 KiB  
Article
A Real-Time Signal Measurement System Using FPGA-Based Deep Learning Accelerators and Microwave Photonic
by Longlong Zhang, Tong Zhou, Jie Yang, Yin Li, Zhiwen Zhang, Xiang Hu and Yuanxi Peng
Remote Sens. 2024, 16(23), 4358; https://doi.org/10.3390/rs16234358 - 22 Nov 2024
Viewed by 924
Abstract
Deep learning techniques have been widely investigated as an effective method for signal measurement in recent years. However, most existing deep learning-based methods still face difficulty in deploying on embedded platforms and perform poorly in real-time applications. To address this, this paper develops [...] Read more.
Deep learning techniques have been widely investigated as an effective method for signal measurement in recent years. However, most existing deep learning-based methods still face difficulty in deploying on embedded platforms and perform poorly in real-time applications. To address this, this paper develops two accelerators, as the core of the signal measurement system, for intelligent signal processing. Firstly, by introducing the idea of automated framework, we propose a simplest deep neural network (DNN)-based hardware structure, which automatically maps algorithms to hardware modules, supports configurable parameters, and has the advantage of low latency, with an average inference time of only 3.5 μs. Subsequently, another accelerator is designed with the efficient hardware structure of the long short-term memory (LSTM) + DNN model, demonstrating outstanding performance with a classification accuracy of 98.82%, mean absolute error (MAE) of 0.27°, and root mean square errors (RMSE) of 0.392° after model compression. Moreover, parallel optimization strategies are exploited to further reduce latency and support simultaneous frequency and direction measurement tasks. Finally, we test the actual collected signal data on the XCVU13P field programmable gate array (FPGA). The results show that the time of inference saves 28–31% for the DNN model and 71–73% for the LSTM + DNN model compared to running on graphic processing unit (GPU). In addition, the parallel strategies further decrease the delay by 23.9% and 37.5% when processing continuous data. The FPGA-based and deep learning-assisted hardware accelerators significantly improve real-time performance and provide a promising solution for signal measurement. Full article
Show Figures

Figure 1

Figure 1
<p>Microwave direction finding system with long-baseline array. DDMZM: dual-drive Mach Zehnder modulator; PD: photodetector; LNA: low noise amplifier; E<sub>i</sub>: digitized envelope voltage.</p>
Full article ">Figure 2
<p>The LSTM cell.</p>
Full article ">Figure 3
<p>The proposed architecture of the overall system.</p>
Full article ">Figure 4
<p>The framework from algorithm to hardware implementation based on the DNN model.</p>
Full article ">Figure 5
<p>The least complex hardware structure based on the DNN model.</p>
Full article ">Figure 6
<p>The hardware design of the intelligent processing module based on LSTM + DNN.</p>
Full article ">Figure 7
<p>Parallel strategies within the layers. (<b>a</b>) LSTM layer; (<b>b</b>) FC layer.</p>
Full article ">Figure 8
<p>Coarse-grained inter-layer parallelism strategy between layers. (<b>a</b>) The original latency; (<b>b</b>) the optimized latency.</p>
Full article ">Figure 9
<p>The task-level parallel strategy of the intelligent processing module.</p>
Full article ">Figure 10
<p>The loss and accuracy versus epoch given by the proposed LSTM + DNN model. (<b>a</b>) The loss; (<b>b</b>) The accuracy.</p>
Full article ">Figure 11
<p>The experimental results of DOA estimation, including actual DOA, estimated DOA, and the corresponding errors. (<b>a</b>) The DNN model; (<b>b</b>) the LSTM + DNN model.</p>
Full article ">Figure 12
<p>Utilized area of the compressed model for DOA. The orange represents the LSTM layer, while the green represents the other layers.</p>
Full article ">Figure 13
<p>Comparison of latency for processing multiple input data based on FPGA. (<b>a</b>) DOA task; (<b>b</b>) IFM task.</p>
Full article ">
Back to TopTop