MDPI - Publisher of Open Access Journals

32 pages, 2960 KiB

Open AccessArticle

Comparing Application-Level Hardening Techniques for Neural Networks on GPUs

by Giuseppe Esposito, Juan-David Guerrero-Balaguera, Josie E. Rodriguez Condia and Matteo Sonza Reorda

Electronics 2025, 14(5), 1042; https://doi.org/10.3390/electronics14051042 - 6 Mar 2025

Viewed by 79

Neural networks (NNs) are essential in advancing modern safety-critical systems. Lightweight NN architectures are deployed on resource-constrained devices using hardware accelerators like Graphics Processing Units (GPUs) for fast responses. However, the latest semiconductor technologies may be affected by physical faults that can jeopardize [...] Read more.

Neural networks (NNs) are essential in advancing modern safety-critical systems. Lightweight NN architectures are deployed on resource-constrained devices using hardware accelerators like Graphics Processing Units (GPUs) for fast responses. However, the latest semiconductor technologies may be affected by physical faults that can jeopardize the NN computations, making fault mitigation crucial for safety-critical domains. The recent studies propose software-based Hardening Techniques (HTs) to address these faults. However, the proposed fault countermeasures are evaluated through different hardware-agnostic error models neglecting the effort required for their implementation and different test benches. Comparing application-level HTs across different studies is challenging, leaving it unclear (i) their effectiveness against hardware-aware error models on any NN and (ii) which HTs provide the best trade-off between reliability enhancement and implementation cost. In this study, application-level HTs are evaluated homogeneously and independently by performing a study on the feasibility of implementation and a reliability assessment under two hardware-aware error models: (i) weight single bit-flips and (ii) neuron bit error rate. Our results indicate that not all HTs suit every NN architecture, and their effectiveness varies depending on the evaluated error model. Techniques based on the range restriction of activation function consistently outperform others, achieving up to 58.23% greater mitigation effectiveness while keeping the introduced overhead at inference time low while requiring a contained effort in their implementation. Full article

► Show Figures

Figure 1

30 pages, 1684 KiB

Open AccessArticle

Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations

by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi

Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025

Viewed by 206

Abstract

Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.

Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

► Show Figures

Figure 1

22 pages, 24659 KiB

Open AccessArticle

A Multi-Scale Fusion Deep Learning Approach for Wind Field Retrieval Based on Geostationary Satellite Imagery

by Wei Zhang, Yapeng Wu, Kunkun Fan, Xiaojiang Song, Renbo Pang and Boyu Guoan

Remote Sens. 2025, 17(4), 610; https://doi.org/10.3390/rs17040610 - 11 Feb 2025

Viewed by 414

Abstract

Wind field retrieval, a crucial component of weather forecasting, has been significantly enhanced by recent advances in deep learning. However, existing approaches that are primarily focused on wind speed retrieval are limited by their inability to achieve real-time, full-coverage retrievals at large scales. [...] Read more.

Wind field retrieval, a crucial component of weather forecasting, has been significantly enhanced by recent advances in deep learning. However, existing approaches that are primarily focused on wind speed retrieval are limited by their inability to achieve real-time, full-coverage retrievals at large scales. To address this problem, we propose a novel multi-scale fusion retrieval (MFR) method, leveraging geostationary observation satellites. At the mesoscale, MFR incorporates a cloud-to-wind transformer model, which employs local self-attention mechanisms to extract detailed wind field features. At large scales, MFR incorporates a multi-encoder coordinate U-net model, which incorporates multiple encoders and utilises coordinate information to fuse meso- to large-scale features, enabling accurate and regionally complete wind field retrievals, while reducing the computational resources required. The MFR method was validated using Level 1 data from the Himawari-8 satellite, covering a geographic range of 0–60°N and 100–160°E, at a resolution of 0.25°. Wind field retrieval was accomplished within seconds using a single graphics processing unit. The mean absolute error of wind speed obtained by the MFR was 0.97 m/s, surpassing the accuracy of the CFOSAT and HY-2B Level 2B wind field products. The mean absolute error for wind direction achieved by the MFR was 23.31°, outperforming CFOSAT Level 2B products and aligning closely with HY-2B Level 2B products. The MFR represents a pioneering approach for generating initial fields for large-scale grid forecasting models. Full article

(This article belongs to the Special Issue Image Processing from Aerial and Satellite Imagery)

► Show Figures

Figure 1

20 pages, 732 KiB

Open AccessArticle

VCONV: A Convolutional Neural Network Accelerator for FPGAs

by Srikanth Neelam and A. Amalin Prince

Electronics 2025, 14(4), 657; https://doi.org/10.3390/electronics14040657 - 8 Feb 2025

Viewed by 355

Abstract

Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give [...] Read more.

Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)

► Show Figures

Figure 1

14 pages, 2560 KiB

Open AccessArticle

Novel GPU-Based Method for the Generalized Maximum Flow Problem

by Delia Elena Spridon, Adrian Marius Deaconu and Javad Tayyebi

Computation 2025, 13(2), 40; https://doi.org/10.3390/computation13020040 - 5 Feb 2025

Viewed by 335

Abstract

This paper investigates the application of a minimum loss path finding algorithm to determine the maximum flow in generalized networks that are characterized by arc losses or gains. In these generalized network flow problems, each arc has not only a defined capacity but [...] Read more.

This paper investigates the application of a minimum loss path finding algorithm to determine the maximum flow in generalized networks that are characterized by arc losses or gains. In these generalized network flow problems, each arc has not only a defined capacity but also a loss or gain factor, which must be taken into consideration when calculating the maximum achievable flow. This extension of the traditional maximum flow problem requires a more comprehensive approach, where the maximum amount of flow is determined by accounting for additional factors such as costs, varying arc capacities, and the specific loss or gain associated with each arc. This paper extends the classic Ford–Fulkerson algorithm, adapting it to iteratively identify source-to-sink (s − t) residual directed paths with minimum cumulative loss and generalized augmenting paths (GAPs), thus enabling the efficient computation of maximum flow in such complex networks. Moreover, to enhance the computational performance of the proposed algorithm, we conducted extensive studies on parallelization techniques using graphics processing units (GPUs). Significant improvements in the algorithm’s efficiency and scalability were achieved. The results demonstrate the potential of GPU-accelerated computations in handling real-world applications where generalized network flows with arc losses and gains are prevalent, such as in telecommunications, transportation, or logistics networks. Full article

► Show Figures

Figure 1

35 pages, 2222 KiB

Open AccessFeature PaperArticle

Multithreaded and GPU-Based Implementations of a Modified Particle Swarm Optimization Algorithm with Application to Solving Large-Scale Systems of Nonlinear Equations

by Bruno Silva, Luiz Guerreiro Lopes and Fábio Mendonça

Electronics 2025, 14(3), 584; https://doi.org/10.3390/electronics14030584 - 1 Feb 2025

Viewed by 480

Abstract

This paper presents a novel Graphics Processing Unit (GPU) accelerated implementation of a modified Particle Swarm Optimization (PSO) algorithm specifically designed to solve large-scale Systems of Nonlinear Equations (SNEs). The proposed GPU-based parallel version of the PSO algorithm uses the inherent parallelism of [...] Read more.

This paper presents a novel Graphics Processing Unit (GPU) accelerated implementation of a modified Particle Swarm Optimization (PSO) algorithm specifically designed to solve large-scale Systems of Nonlinear Equations (SNEs). The proposed GPU-based parallel version of the PSO algorithm uses the inherent parallelism of modern hardware architectures. Its performance is compared against both sequential and multithreaded Central Processing Unit (CPU) implementations. The primary objective is to evaluate the efficiency and scalability of PSO across different hardware platforms with a focus on solving large-scale SNEs involving thousands of equations and variables. The GPU-parallelized and multithreaded versions of the algorithm were implemented in the Julia programming language. Performance analyses were conducted on an NVIDIA A100 GPU and an AMD EPYC 7643 CPU. The tests utilized a set of challenging, scalable SNEs with dimensions ranging from 1000 to 5000. Results demonstrate that the GPU accelerated modified PSO substantially outperforms its CPU counterparts, achieving substantial speedups and consistently surpassing the highly optimized multithreaded CPU implementation in terms of computation time and scalability as the problem size increases. Therefore, this work evaluates the trade-offs between different hardware platforms and underscores the potential of GPU-based parallelism for accelerating SNE solvers. Full article

(This article belongs to the Special Issue High-Performance Computing for AI: Architecture, Systems, and Algorithms)

► Show Figures

Figure 1

Figure 1
Execution pipeline of the multithreaded CPU implementation of the PPSO algorithm. Full article ">Figure 2
Execution pipeline for the GPU-based implementation of the PPSO algorithm. Full article ">Figure 3
Multithreaded performance analysis by problem dimension: (a) 1000, (b) 2000, (c) 3000, (d) 4000, (e) 5000. Mean processing time and speedup ratio as functions of the number of threads (1 thread corresponds to sequential execution). Full article ">Figure 4
Mean aggregated speedups and Amdahl’s law theoretical predictions. Full article ">Figure 5
GPU parallelization performance for each test problem: speedup ratios for FP32 and FP64 by problem dimension, relative to sequential and 128-threaded executions. Subfigures (a,c) show FP32 speedup: (a) sequential, (c) 128-threaded; subfigures (b,d) show FP64 speedup: (b) sequential, (d) 128-threaded. Full article ">Figure 6
Mean GPU-based parallelization performance relative to sequential (left) and 128-threaded (right) executions: mean speedup ratios for FP32 and FP64 as functions of problem size. Full article ">

18 pages, 3106 KiB

Open AccessArticle

An FPGA-Based Hybrid Overlapping Acceleration Architecture for Small-Target Remote Sensing Detection

by Nan Fang, Liyuan Li, Xiaoxuan Zhou, Wencong Zhang and Fansheng Chen

Remote Sens. 2025, 17(3), 494; https://doi.org/10.3390/rs17030494 - 31 Jan 2025

Viewed by 578

Abstract

Small-object detection in satellite remote sensing images plays a pivotal role in the field of remote sensing. Achieving high-performance real-time detection demands not only efficient algorithms but also low-power, high-performance hardware platforms. However, most mainstream target detection methods currently rely on graphics processing [...] Read more.

Small-object detection in satellite remote sensing images plays a pivotal role in the field of remote sensing. Achieving high-performance real-time detection demands not only efficient algorithms but also low-power, high-performance hardware platforms. However, most mainstream target detection methods currently rely on graphics processing units (GPUs) for acceleration, and the high power consumption of GPUs limits their use in resource-constrained platforms such as small satellites. Moreover, small-object detection faces multiple challenges: the targets occupy only a small number of pixels in the image, the background is often complex with significant noise interference, and existing detection models typically exhibit low accuracy when dealing with small targets. In addition, the large number of parameters in these models makes direct deployment on embedded devices difficult. To address these issues, we propose a hybrid overlapping acceleration architecture based on FPGA, along with a lightweight model derived from YOLOv5s that is specifically designed to enhance the detection of small objects in remote sensing images. This model incorporates a lightweight GhostBottleneckV2 module, significantly reducing both model parameters and computational complexity. Experimental results on the TIFAD thermal infrared small-object dataset show that our approach achieves an average precision (mAP) of 67.8% while consuming an average power of only 2.8 W. The robustness of the proposed model is verified by the HRSID dataset. Combining real-time performance with high energy efficiency, this architecture is particularly well suited for on-board remote sensing image processing systems, where reliable and efficient small-object detection is paramount. Full article

(This article belongs to the Special Issue Artificial Intelligence-Driven Methods for Remote Sensing Target and Object Detection II)

► Show Figures

Figure 1

33 pages, 19016 KiB

Open AccessArticle

Multitask Learning-Based Pipeline-Parallel Computation Offloading Architecture for Deep Face Analysis

by Faris S. Alghareb and Balqees Talal Hasan

Computers 2025, 14(1), 29; https://doi.org/10.3390/computers14010029 - 20 Jan 2025

Viewed by 1213

Abstract

Deep Neural Networks (DNNs) have been widely adopted in several advanced artificial intelligence applications due to their competitive accuracy to the human brain. Nevertheless, the superior accuracy of a DNN is achieved at the expense of intensive computations and storage complexity, requiring custom [...] Read more.

Deep Neural Networks (DNNs) have been widely adopted in several advanced artificial intelligence applications due to their competitive accuracy to the human brain. Nevertheless, the superior accuracy of a DNN is achieved at the expense of intensive computations and storage complexity, requiring custom expandable hardware, i.e., graphics processing units (GPUs). Interestingly, leveraging the synergy of parallelism and edge computing can significantly improve CPU-based hardware platforms. Therefore, this manuscript explores levels of parallelism techniques along with edge computation offloading to develop an innovative hardware platform that improves the efficacy of deep learning computing architectures. Furthermore, the multitask learning (MTL) approach is employed to construct a parallel multi-task classification network. These tasks include face detection and recognition, age estimation, gender recognition, smile detection, and hair color and style classification. Additionally, both pipeline and parallel processing techniques are utilized to expedite complicated computations, boosting the overall performance of the presented deep face analysis architecture. A computation offloading approach, on the other hand, is leveraged to distribute computation-intensive tasks to the server edge, whereas lightweight computations are offloaded to edge devices, i.e., Raspberry Pi 4. To train the proposed deep face analysis network architecture, two custom datasets (HDDB and FRAED) were created for head detection and face-age recognition. Extensive experimental results demonstrate the efficacy of the proposed pipeline-parallel architecture in terms of execution time. It requires 8.2 s to provide detailed face detection and analysis for an individual and 23.59 s for an inference containing 10 individuals. Moreover, a speedup of 62.48% is achieved compared to the sequential-based edge computing architecture. Meanwhile, 25.96% speed performance acceleration is realized when implementing the proposed pipeline-parallel architecture only on the server edge compared to the sever sequential implementation. Considering classification efficiency, the proposed classification modules achieve an accuracy of 88.55% for hair color and style classification and a remarkable prediction outcome of 100% for face recognition and age estimation. To summarize, the proposed approach can assist in reducing the required execution time and memory capacity by processing all facial tasks simultaneously on a single deep neural network rather than building a CNN model for each task. Therefore, the presented pipeline-parallel architecture can be a cost-effective framework for real-time computer vision applications implemented on resource-limited devices. Full article

► Show Figures

Figure 1

22 pages, 9154 KiB

Open AccessArticle

Turbulent Flow Through Sluice Gate and Weir Using Smoothed Particle Hydrodynamics: Evaluation of Turbulence Models, Boundary Conditions, and 3D Effects

by Efstathios Chatzoglou and Antonios Liakopoulos

Water 2025, 17(2), 152; https://doi.org/10.3390/w17020152 - 8 Jan 2025

Viewed by 720

Abstract

Understanding flow dynamics around hydraulic structures is essential for optimizing water management systems and predicting flow behavior in real-world applications. In this study, we simulate a 3D flow control system featuring a sluice gate and a weir, commonly used in hydraulic engineering. The [...] Read more.

Understanding flow dynamics around hydraulic structures is essential for optimizing water management systems and predicting flow behavior in real-world applications. In this study, we simulate a 3D flow control system featuring a sluice gate and a weir, commonly used in hydraulic engineering. The focus is on accurately incorporating modified dynamic boundary conditions (mDBCs) and viscosity treatment to improve the simulation of complex, turbulent flows. We assess the performance of the Smoothed Particle Hydrodynamics (SPH) method in handling these challenging conditions. Especially when the boundary conditions and applicability to industry are two of the SPH method’s grand challenges. Simulations were conducted on a Graphics Processing Unit (GPU) using the DualSPHysics code. The results were compared to theoretical predictions and experimental data found in the literature. Key hydraulic characteristics, including 3D flow effects, hydraulic jump formation, and turbulent behavior, are examined. The combination of mDBCs with the Laminar plus sub-particle scale turbulence model achieved the correct simulation results. The findings demonstrate agreement between simulations, theoretical predictions, and experimental results. This work provides a reliable framework for analyzing turbulent flows in hydraulic structures and can be used as reference data or a prototype for larger-scale simulations in both research and engineering design, particularly in contexts requiring robust and precise flow control and/or environmental management. Full article

(This article belongs to the Special Issue Hydrodynamic Science Experiments and Simulations)

► Show Figures

Figure 1

34 pages, 9890 KiB

Open AccessArticle

Synchronized Delay Measurement of Multi-Stream Analysis over Data Concentrator Units

by Anvarjon Yusupov, Sun Park and JongWon Kim

Electronics 2025, 14(1), 81; https://doi.org/10.3390/electronics14010081 - 27 Dec 2024

Viewed by 652

Abstract

Autonomous vehicles (AVs) rely heavily on multi-modal sensors to perceive their surroundings and make real-time decisions. However, the increasing complexity of these sensors, combined with the computational demands of AI models and the challenges of synchronizing data across multiple inputs, presents significant obstacles [...] Read more.

Autonomous vehicles (AVs) rely heavily on multi-modal sensors to perceive their surroundings and make real-time decisions. However, the increasing complexity of these sensors, combined with the computational demands of AI models and the challenges of synchronizing data across multiple inputs, presents significant obstacles for AV systems. These challenges of the AV domain often lead to performance latency, resulting in delayed decision-making, causing major traffic accidents. The data concentrator unit (DCU) concept addresses these issues by optimizing data pipelines and implementing intelligent control mechanisms to process sensor data efficiently. Identifying and addressing bottlenecks that contribute to latency can enhance system performance, reducing the need for costly hardware upgrades or advanced AI models. This paper introduces a delay measurement tool for multi-node analysis, enabling synchronized monitoring of data pipelines across connected hardware platforms, such as clock-synchronized DCUs. The proposed tool traces the execution flow of software applications and assesses time delays at various stages of the data pipeline in clock-synchronized hardware. The various stages are represented with intuitive graphical visualization, simplifying the identification of performance bottlenecks. Full article

(This article belongs to the Special Issue Advancements in Connected and Autonomous Vehicles)

► Show Figures

Figure 1

40 pages, 1079 KiB

Open AccessArticle

Context-Adaptable Deployment of FastSLAM 2.0 on Graphic Processing Unit with Unknown Data Association

by Jessica Giovagnola, Manuel Pegalajar Cuéllar and Diego Pedro Morales Santos

Appl. Sci. 2024, 14(23), 11466; https://doi.org/10.3390/app142311466 - 9 Dec 2024

Viewed by 1071

Abstract

Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient [...] Read more.

Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient performance. While SLAM solutions aim at ensuring accurate and timely localization and mapping, one of their main limitations is their computational complexity. In this scenario, particle filter-based approaches such as FastSLAM 2.0 can significantly benefit from parallel programming due to their modular construction. The parallelization process involves identifying the parameters affecting the computational complexity in order to distribute the computation among single multiprocessors as efficiently as possible. However, the computational complexity of methodologies such as FastSLAM 2.0 can depend on multiple parameters whose values may, in turn, depend on each specific use case scenario ( ingi.e., the context), leading to multiple possible parallelization designs. Furthermore, the features of the hardware architecture in use can significantly influence the performance in terms of latency. Therefore, the selection of the optimal parallelization modality still needs to be empirically determined. This may involve redesigning the parallel algorithm depending on the context and the hardware architecture. In this paper, we propose a CUDA-based adaptable design for FastSLAM 2.0 on GPU, in combination with an evaluation methodology that enables the assessment of the optimal parallelization modality based on the context and the hardware architecture without the need for the creation of separate designs. The proposed implementation includes the parallelization of all the functional blocks of the FastSLAM 2.0 pipeline. Additionally, we contribute a parallelized design of the data association step through the Joint Compatibility Branch and Bound (JCBB) method. Multiple resampling algorithms are also included to accommodate the needs of a wide variety of navigation scenarios. Full article

(This article belongs to the Special Issue Advancements in Multi-Agent Systems and Artificial Intelligence: Methodologies, Applications, and Future Trends)

► Show Figures

Figure 1

17 pages, 3121 KiB

Open AccessArticle

Real-Time Radar Classification Based on Software-Defined Radio Platforms: Enhancing Processing Speed and Accuracy with Graphics Processing Unit Acceleration

by Seckin Oncu, Mehmet Karakaya, Yaser Dalveren, Ali Kara and Mohammad Derawi

Sensors 2024, 24(23), 7776; https://doi.org/10.3390/s24237776 - 4 Dec 2024

Viewed by 974

Abstract

This paper presents a comprehensive evaluation of real-time radar classification using software-defined radio (SDR) platforms. The transition from analog to digital technologies, facilitated by SDR, has revolutionized radio systems, offering unprecedented flexibility and reconfigurability through software-based operations. This advancement complements the role of [...] Read more.

This paper presents a comprehensive evaluation of real-time radar classification using software-defined radio (SDR) platforms. The transition from analog to digital technologies, facilitated by SDR, has revolutionized radio systems, offering unprecedented flexibility and reconfigurability through software-based operations. This advancement complements the role of radar signal parameters, encapsulated in the pulse description words (PDWs), which play a pivotal role in electronic support measure (ESM) systems, enabling the detection and classification of threat radars. This study proposes an SDR-based radar classification system that achieves real-time operation with enhanced processing speed. Employing the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm as a robust classifier, the system harnesses Graphical Processing Unit (GPU) parallelization for efficient radio frequency (RF) parameter extraction. The experimental results highlight the efficiency of this approach, demonstrating a notable improvement in processing speed while operating at a sampling rate of up to 200 MSps and achieving an accuracy of 89.7% for real-time radar classification. Full article

(This article belongs to the Section Radar Sensors)

► Show Figures

Figure 1

3597 KiB

Open AccessProceeding Paper

A Tool for Improved Monitoring of Acoustic Beacons and Receivers of the KM3NeT Neutrino Telescope

by Letizia Stella Di Mauro, Dídac Diego-Tortosa, Giorgio Riccobene and Salvatore Viola

Eng. Proc. 2024, 82(1), 77; https://doi.org/10.3390/ecsa-11-20490 - 26 Nov 2024

Viewed by 149

Abstract

KM3NeT is an underwater neutrino detector currently under construction. Since the installation of its first detection unit in 2015, it has been continuously collecting data. Due to its complex design comprising a 3D array of sensors, an Acoustic Positioning System (APS) has been [...] Read more.

KM3NeT is an underwater neutrino detector currently under construction. Since the installation of its first detection unit in 2015, it has been continuously collecting data. Due to its complex design comprising a 3D array of sensors, an Acoustic Positioning System (APS) has been developed to monitor the position of each sensor. Given the increasing number of acoustic sensors used for the APS, both receivers and emitters, a solution has been implemented to check their status. In this contribution, a monitoring tool for this instrumentation is presented, capable of evaluating its status at both the data and operational levels. For effective monitoring, it is crucial to associate the signal recorded by a receiver with the corresponding transmitter. The Acoustic Data Filter (ADF) performs a cross-correlation between the signals retained in a buffer and those emitted by each installed emitter. It saves the maximum peak value and its associated time of arrival for each expected signal. However, the growing number of beacons complicates the differentiation of corresponding transmitters due to the huge amount of data recorded by the ADF needing post-processing. To address this challenge, a monitoring tool is developed that analyzes the internal clock of each emitter to distinguish and filter the data collected by the ADF. This tool has proven to be highly effective at verifying the correct operation of all acoustic devices deployed at sea. The acoustic monitoring graphical output produced for each data slot facilitates quick failure detection, enabling a swift response. Last but not least, the tool is modular and scalable, adapting to the addition or removal of sensors from the detector. Full article

► Show Figures

Figure 1

Figure 1
(a) Acoustic beacon MAB100 (aluminium version) produced by MSM, Valencia, Spain; (b) DG0330 hydrophone, produced by Colmar, La Spezia, Italy; (c) Pz27 encapsuled piezoceramic sensor, assembled by GCD-PCB-Design GmbH, Erlangen, Germany. Full article ">Figure 2
(a) D0ARCA028 footprint; (b) D1ORCA019 footprint. Full article ">Figure 3
Modulus between the selected ToAs and the RR plotted as a function of time. The correct ToAs (highlighted in green) align along a straight line. This example illustrates the APS data from the ARCA hydrophone in DU26 related to WF14. Full article ">Figure 4
Modulus between the selected ToAs and the RR plotted as a function of time. This example illustrates the APS data from the ARCA hydrophone in DU1 related to WF33. Full article ">Figure 5
Modulus between the selected ToAs and the RR plotted as a function of time. This example illustrates the APS data from the ARCA hydrophone in DU28 related to WF33 using (a) <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>, where all the data correspond to incorrect ToAs, and (b) <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math>, where there are some correct ToAs. Full article ">Figure 6
Modulus of the selected ToAs and the RR plotted as a function of time. (a) Incorrect ToAs. Data from the ARCA hydrophone in DU19, where ToAs deviate from the expected alignment, suggesting a potential issue with this hydrophone. (b) Correct ToAs. Data from the ARCA hydrophone in DU22, where ToAs align correctly along the expected straight line, indicating proper functioning of the AB. Full article ">Figure 7
Example of a cumulative acoustic monitoring plot over multiple runs for ORCA. Full article ">Figure 8
Example of a cumulative acoustic monitoring plot over multiple runs for ARCA. Full article ">

23 pages, 8542 KiB

Open AccessArticle

Graphics Processing Unit-Accelerated Propeller Computational Fluid Dynamics Using AmgX: Performance Analysis Across Mesh Types and Hardware Configurations

by Yue Zhu, Jin Gan, Yongshui Lin and Weiguo Wu

J. Mar. Sci. Eng. 2024, 12(12), 2134; https://doi.org/10.3390/jmse12122134 - 22 Nov 2024

Viewed by 743

Abstract

Computational fluid dynamics (CFD) has become increasingly prevalent in marine and offshore engineering, with enhancing simulation efficiency emerging as a critical challenge. This study systematically evaluates the application of graphics processing unit (GPU) acceleration technology in CFD simulation of propeller open water performance. [...] Read more.

Computational fluid dynamics (CFD) has become increasingly prevalent in marine and offshore engineering, with enhancing simulation efficiency emerging as a critical challenge. This study systematically evaluates the application of graphics processing unit (GPU) acceleration technology in CFD simulation of propeller open water performance. Numerical simulations of the VP1304 propeller model were performed using OpenFOAM v2312 integrated with the NVIDIA AmgX library. The research compared GPU acceleration performance against conventional CPU methods across various hardware configurations and mesh types (tetrahedral, hexahedral-dominant, and polyhedral). Results demonstrate that GPU acceleration significantly improved computational efficiency, with tetrahedral meshes achieving over 400% speedup in a 4-GPU configuration, while polyhedral meshes reached over 500% speedup with a fixed mesh count. Among the mesh types, hexahedral-dominant meshes performed best in capturing flow field details. The study also found that GPU acceleration does not compromise simulation accuracy, but its effectiveness is closely related to mesh type and hardware configuration. Notably, GPUs demonstrate more significant advantages when handling large-scale problems. These findings have important practical implications for improving propeller design processes and shortening product development cycles. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

20 pages, 5217 KiB

Open AccessArticle

A Real-Time Signal Measurement System Using FPGA-Based Deep Learning Accelerators and Microwave Photonic

by Longlong Zhang, Tong Zhou, Jie Yang, Yin Li, Zhiwen Zhang, Xiang Hu and Yuanxi Peng

Remote Sens. 2024, 16(23), 4358; https://doi.org/10.3390/rs16234358 - 22 Nov 2024

Viewed by 924

Abstract

Deep learning techniques have been widely investigated as an effective method for signal measurement in recent years. However, most existing deep learning-based methods still face difficulty in deploying on embedded platforms and perform poorly in real-time applications. To address this, this paper develops [...] Read more.

Deep learning techniques have been widely investigated as an effective method for signal measurement in recent years. However, most existing deep learning-based methods still face difficulty in deploying on embedded platforms and perform poorly in real-time applications. To address this, this paper develops two accelerators, as the core of the signal measurement system, for intelligent signal processing. Firstly, by introducing the idea of automated framework, we propose a simplest deep neural network (DNN)-based hardware structure, which automatically maps algorithms to hardware modules, supports configurable parameters, and has the advantage of low latency, with an average inference time of only 3.5 μs. Subsequently, another accelerator is designed with the efficient hardware structure of the long short-term memory (LSTM) + DNN model, demonstrating outstanding performance with a classification accuracy of 98.82%, mean absolute error (MAE) of 0.27°, and root mean square errors (RMSE) of 0.392° after model compression. Moreover, parallel optimization strategies are exploited to further reduce latency and support simultaneous frequency and direction measurement tasks. Finally, we test the actual collected signal data on the XCVU13P field programmable gate array (FPGA). The results show that the time of inference saves 28–31% for the DNN model and 71–73% for the LSTM + DNN model compared to running on graphic processing unit (GPU). In addition, the parallel strategies further decrease the delay by 23.9% and 37.5% when processing continuous data. The FPGA-based and deep learning-assisted hardware accelerators significantly improve real-time performance and provide a promising solution for signal measurement. Full article

(This article belongs to the Topic Advanced Array Signal Processing for B5G/6G: Models, Algorithms, and Applications)

► Show Figures

Figure 1

Search Results (594)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (594)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI