Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (250)

Search Parameters:
Keywords = large receptive field

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 11232 KiB  
Article
Deep Learning-Based Docking Scheme for Autonomous Underwater Vehicles with an Omnidirectional Rotating Optical Beacon
by Yiyang Li, Kai Sun, Zekai Han and Jichao Lang
Drones 2024, 8(12), 697; https://doi.org/10.3390/drones8120697 - 21 Nov 2024
Viewed by 331
Abstract
Visual recognition and localization of underwater optical beacons are critical for AUV docking, but traditional beacons are limited by fixed directionality and light attenuation in water. To extend the range of optical docking, this study designs a novel omnidirectional rotating optical beacon that [...] Read more.
Visual recognition and localization of underwater optical beacons are critical for AUV docking, but traditional beacons are limited by fixed directionality and light attenuation in water. To extend the range of optical docking, this study designs a novel omnidirectional rotating optical beacon that provides 360-degree light coverage over 45 m, improving beacon detection probability through synchronized scanning. Addressing the challenges of light centroid detection, we introduce a parallel deep learning detection algorithm based on an improved YOLOv8-pose model. Initially, an underwater optical beacon dataset encompassing various light patterns was constructed. Subsequently, the network was optimized by incorporating a small detection head, implementing dynamic convolution and receptive-field attention convolution for single-stage multi-scale localization. A post-processing method based on keypoint joint IoU matching was proposed to filter redundant detections. The algorithm achieved 93.9% AP at 36.5 FPS, with at least a 5.8% increase in detection accuracy over existing methods. Moreover, a light-source-based measurement method was developed to accurately detect the beacon’s orientation. Experimental results indicate that this scheme can achieve high-precision omnidirectional guidance with azimuth and pose estimation errors of -4.54° and 3.09°, respectively, providing a reliable solution for long-range and large-scale optical docking. Full article
Show Figures

Figure 1

Figure 1
<p>Framework of the underwater omnidirectional rotating optical beacon docking system.</p>
Full article ">Figure 2
<p>Schematic of the underwater omnidirectional rotating optical beacon docking system.</p>
Full article ">Figure 3
<p>Structural diagram of the underwater omnidirectional rotating optical beacon.</p>
Full article ">Figure 4
<p>Underwater light source selection. (<b>a</b>) 10 W, 60°; (<b>b</b>) 30 W, 60°; (<b>c</b>) 30 W, 10°.</p>
Full article ">Figure 5
<p>Annotation information of the underwater optical beacon dataset. (<b>a</b>) Normalized positions of the bounding boxes; (<b>b</b>) Normalized sizes of the bounding boxes. Both panels are presented through histograms with 50 bins per dimension, with darker colours indicating more partitions.</p>
Full article ">Figure 6
<p>Improved network architecture of YOLOv8-pose.</p>
Full article ">Figure 7
<p>Structure of RFAConv.</p>
Full article ">Figure 8
<p>Example of redundant bounding boxes.</p>
Full article ">Figure 9
<p>Detection results of different methods. Each row from top to bottom corresponds to scenario 1, scenario 2, and scenario 3, respectively. (<b>a</b>) Ours; (<b>b</b>) YOLOv8n-pose; (<b>c</b>) YOLOv8n with centroid; (<b>d</b>) Tradition; (<b>e</b>) CNN.</p>
Full article ">Figure 10
<p>Error diagram.</p>
Full article ">Figure 11
<p>Experimental setup.</p>
Full article ">Figure 12
<p>Detection results of different methods. (<b>a</b>) Daylight, the beacon faces forward; (<b>b</b>) darkness, the beacon faces forward; (<b>c</b>) daylight, the beacon faces sideways; (<b>d</b>) darkness, the beacon faces sideways.</p>
Full article ">
18 pages, 4823 KiB  
Article
ME-FCN: A Multi-Scale Feature-Enhanced Fully Convolutional Network for Building Footprint Extraction
by Hui Sheng, Yaoteng Zhang, Wei Zhang, Shiqing Wei, Mingming Xu and Yasir Muhammad
Remote Sens. 2024, 16(22), 4305; https://doi.org/10.3390/rs16224305 - 19 Nov 2024
Viewed by 478
Abstract
The precise extraction of building footprints using remote sensing technology is increasingly critical for urban planning and development amid growing urbanization. However, considering the complexity of building backgrounds, diverse scales, and varied appearances, accurately and efficiently extracting building footprints from various remote sensing [...] Read more.
The precise extraction of building footprints using remote sensing technology is increasingly critical for urban planning and development amid growing urbanization. However, considering the complexity of building backgrounds, diverse scales, and varied appearances, accurately and efficiently extracting building footprints from various remote sensing images remains a significant challenge. In this paper, we propose a novel network architecture called ME-FCN, specifically designed to perceive and optimize multi-scale features to effectively address the challenge of extracting building footprints from complex remote sensing images. We introduce a Squeeze-and-Excitation U-Block (SEUB), which cascades multi-scale semantic information exploration in shallow feature maps and incorporates channel attention to optimize features. In the network’s deeper layers, we implement an Adaptive Multi-scale feature Enhancement Block (AMEB), which captures large receptive field information through concatenated atrous convolutions. Additionally, we develop a novel Dual Multi-scale Attention (DMSA) mechanism to further enhance the accuracy of cascaded features. DMSA captures multi-scale semantic features across both channel and spatial dimensions, suppresses redundant information, and realizes multi-scale feature interaction and fusion, thereby improving the overall accuracy and efficiency. Comprehensive experiments on three datasets demonstrate that ME-FCN outperforms mainstream segmentation methods. Full article
Show Figures

Figure 1

Figure 1
<p>Architecture of the proposed ME-FCN method. (Blue represents the encoder and orange represents the decoder.)</p>
Full article ">Figure 2
<p>Overall architecture of the SEUB. SEUB adopts a simple U-shaped structure, with the bottom layer utilizing atrous convolution to achieve a larger receptive field, while the skip connections incorporate squeeze-and-excitation attention to enhance network stability.</p>
Full article ">Figure 3
<p>Residual structure schematic of SEUB.</p>
Full article ">Figure 4
<p>Schematic diagram of the Adaptive Multi-scale feature Enhancement Block. The AMEB consists of atrous convolutions with multiple dilation rates and introduces an auxiliary branch, allowing high-level feature maps to autonomously select information utilization, reducing the randomness of feature utilization.</p>
Full article ">Figure 5
<p>Dual Multi-scale Attention structure. DMSA performs feature alignment at three scales to alleviate semantic differences and achieve feature interaction and fusion.</p>
Full article ">Figure 6
<p>The set of predicted results obtained from different algorithms on the WHU aerial dataset (blue indicates the number of false negatives, red represents the number of false positives, white indicates the correct identification of positive samples, and black shows the correct identification of negative samples).</p>
Full article ">Figure 7
<p>The set of predicted results obtained from different algorithms on the Massachusetts dataset (blue indicates the number of false negatives, red represents the number of false positives, white indicates the correct identification of positive samples, and black shows the correct identification of negative samples).</p>
Full article ">Figure 8
<p>The set of predicted results obtained from different algorithms on the GF-2 building dataset (blue indicates the number of false positives, red represents the number of false negatives, white indicates the correct identification of positive samples, and black shows the correct identification of negative samples).</p>
Full article ">Figure 9
<p>Ablation experiment results with different module combinations. (<b>a</b>) Image, (<b>b</b>) ground truth, (<b>c</b>) baseline, (<b>d</b>) Baseline + SEUB (<b>e</b>) baseline + SEUB + DMSA, and (<b>f</b>) baseline + SEUB + DMSA + AMEB.</p>
Full article ">
21 pages, 7007 KiB  
Article
LEM-Detector: An Efficient Detector for Photovoltaic Panel Defect Detection
by Xinwen Zhou, Xiang Li, Wenfu Huang and Ran Wei
Appl. Sci. 2024, 14(22), 10290; https://doi.org/10.3390/app142210290 - 8 Nov 2024
Viewed by 475
Abstract
Photovoltaic panel defect detection presents significant challenges due to the wide range of defect scales, diverse defect types, and severe background interference, often leading to a high rate of false positives and missed detections. To address these challenges, this paper proposes the LEM-Detector, [...] Read more.
Photovoltaic panel defect detection presents significant challenges due to the wide range of defect scales, diverse defect types, and severe background interference, often leading to a high rate of false positives and missed detections. To address these challenges, this paper proposes the LEM-Detector, an efficient end-to-end photovoltaic panel defect detector based on the transformer architecture. To address the low detection accuracy for Crack and Star crack defects and the imbalanced dataset, a novel data augmentation method, the Linear Feature Augmentation (LFA) module, specifically designed for linear features, is introduced. LFA effectively improves model training performance and robustness. Furthermore, the Efficient Feature Enhancement Module (EFEM) is presented to enhance the receptive field, suppress redundant information, and emphasize meaningful features. To handle defects of varying scales, complementary semantic information from different feature layers is leveraged for enhanced feature fusion. A Multi-Scale Multi-Feature Pyramid Network (MMFPN) is employed to selectively aggregate boundary and category information, thereby improving the accuracy of multi-scale target recognition. Experimental results on a large-scale photovoltaic panel dataset demonstrate that the LEM-Detector achieves a detection accuracy of 94.7% for multi-scale defects, outperforming several state-of-the-art methods. This approach effectively addresses the challenges of photovoltaic panel defect detection, paving the way for more reliable and accurate defect identification systems. This research will contribute to the automatic detection of surface defects in industrial production, ultimately enhancing production efficiency. Full article
Show Figures

Figure 1

Figure 1
<p>Common defect types in photovoltaic panels.</p>
Full article ">Figure 2
<p>Overall framework of the LEM-Detector.</p>
Full article ">Figure 3
<p>The architecture of the LFA.</p>
Full article ">Figure 4
<p>Effects of data augmentation ((<b>A</b>): defect overlap; (<b>B</b>): small-scale Star crack; (<b>C</b>): small-scale Finger; (<b>D</b>): crack crossing the busbar).</p>
Full article ">Figure 5
<p>The architecture of the EFEM.</p>
Full article ">Figure 6
<p>The architecture of the CIA.</p>
Full article ">Figure 7
<p>The proposed LEM-Detector achieves state-of-the-art performance when compared to existing prominent object detectors.</p>
Full article ">Figure 8
<p>P-R curve of LEM-Detector.</p>
Full article ">Figure 9
<p>Detection results of LEM-Detector.</p>
Full article ">Figure 10
<p>Heatmaps of the feature extraction stage.</p>
Full article ">Figure 11
<p>Heatmaps of the feature fusion stage.</p>
Full article ">
20 pages, 11655 KiB  
Article
Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment
by Xing Wan, Juliana Johari and Fazlina Ahmat Ruslan
Information 2024, 15(11), 717; https://doi.org/10.3390/info15110717 - 7 Nov 2024
Viewed by 472
Abstract
Text CAPTCHAs are crucial security measures deployed on global websites to deter unauthorized intrusions. The presence of anti-attack features incorporated into text CAPTCHAs limits the effectiveness of evaluating them, despite CAPTCHA recognition being an effective method for assessing their security. This study introduces [...] Read more.
Text CAPTCHAs are crucial security measures deployed on global websites to deter unauthorized intrusions. The presence of anti-attack features incorporated into text CAPTCHAs limits the effectiveness of evaluating them, despite CAPTCHA recognition being an effective method for assessing their security. This study introduces a novel color augmentation technique called Variational Color Shift (VCS) to boost the recognition accuracy of different networks. VCS generates a color shift of every input image and then resamples the image within that range to generate a new image, thus expanding the number of samples of the original dataset to improve training effectiveness. In contrast to Random Color Shift (RCS), which treats the color offsets as hyperparameters, VCS estimates color shifts by reparametrizing the points sampled from the uniform distribution using predicted offsets according to every image, which makes the color shifts learnable. To better balance the computation and performance, we also propose two variants of VCS: Sim-VCS and Dilated-VCS. In addition, to solve the overfitting problem caused by disturbances in text CAPTCHAs, we propose an Auto-Encoder (AE) based on Large Separable Kernel Attention (AE-LSKA) to replace the convolutional module with large kernels in the text CAPTCHA recognizer. This new module employs an AE to compress the interference while expanding the receptive field using Large Separable Kernel Attention (LSKA), reducing the impact of local interference on the model training and improving the overall perception of characters. The experimental results show that the recognition accuracy of the model after integrating the AE-LSKA module is improved by at least 15 percentage points on both M-CAPTCHA and P-CAPTCHA datasets. In addition, experimental results demonstrate that color augmentation using VCS is more effective in enhancing recognition, which has higher accuracy compared to RCS and PCA Color Shift (PCA-CS). Full article
(This article belongs to the Special Issue Computer Vision for Security Applications)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>The structure and training process of Deep-CAPTCHA.</p>
Full article ">Figure 2
<p>Adaptive-CAPTCHA improved on Deep-CAPTCHA.</p>
Full article ">Figure 3
<p>Samples of P-CAPTCHA.</p>
Full article ">Figure 4
<p>Samples of M-CAPTCHA.</p>
Full article ">Figure 5
<p>The flowchart of VCS.</p>
Full article ">Figure 6
<p>The difference between VCS and Sim-VCS.</p>
Full article ">Figure 7
<p>The dilated convolution-based sampling of Dilated-VCS.</p>
Full article ">Figure 8
<p>The AE-LSKAs with different dilated kernels and receptive fields.</p>
Full article ">Figure 9
<p>AASR comparison between VCS and RCS using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 10
<p>AASR comparison between VCS and RCS using Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 11
<p>AASR comparison between VCS and PCA-CS using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 12
<p>AASR comparison between VCS and PCA-CS using Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 13
<p>Learning process AE-LSKAs with different dilated kernels integrating into Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 14
<p>Validation accuracy comparison of AE-LSKAs with different dilated kernels integrated into Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 15
<p>Validation accuracy comparison of AE-LSKAs with different kernels integrated into Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 16
<p>Individual character accuracy comparison using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">
17 pages, 11245 KiB  
Article
Underwater Object Detection Algorithm Based on an Improved YOLOv8
by Fubin Zhang, Weiye Cao, Jian Gao, Shubing Liu, Chenyang Li, Kun Song and Hongwei Wang
J. Mar. Sci. Eng. 2024, 12(11), 1991; https://doi.org/10.3390/jmse12111991 - 5 Nov 2024
Viewed by 666
Abstract
Due to the complexity and diversity of underwater environments, traditional object detection algorithms face challenges in maintaining robustness and detection accuracy when applied underwater. This paper proposes an underwater object detection algorithm based on an improved YOLOv8 model. First, the introduction of CIB [...] Read more.
Due to the complexity and diversity of underwater environments, traditional object detection algorithms face challenges in maintaining robustness and detection accuracy when applied underwater. This paper proposes an underwater object detection algorithm based on an improved YOLOv8 model. First, the introduction of CIB building blocks into the backbone network, along with the optimization of the C2f structure and the incorporation of large-kernel depthwise convolutions, effectively enhances the model’s receptive field. This improvement increases the capability of detecting multi-scale objects in complex underwater environments without adding a computational burden. Next, the incorporation of a Partial Self-Attention (PSA) module at the end of the backbone network enhances model efficiency and optimizes the utilization of computational resources while maintaining high performance. Finally, the integration of the Neck component from the Gold-YOLO model improves the neck structure of the YOLOv8 model, facilitating the fusion and distribution of information across different levels, thereby achieving more efficient information integration and interaction. Experimental results show that YOLOv8-CPG significantly outperforms the traditional YOLOv8 in underwater environments. Precision and Recall show improvements of 2.76% and 2.06%. Additionally, mAP50 and mAP50-95 metrics have increased by 1.05% and 3.55%, respectively. Our approach provides an efficient solution to the difficulties encountered in underwater object detection. Full article
(This article belongs to the Special Issue Intelligent Measurement and Control System of Marine Robots)
Show Figures

Figure 1

Figure 1
<p>Structure diagram of the CIB. (<b>a</b>) CIB deployment; (<b>b</b>) Internal structure of the CIB.</p>
Full article ">Figure 2
<p>Structure diagram of the CIBC2f.</p>
Full article ">Figure 3
<p>Structure diagram of the low-order collection-distribution mechanism.</p>
Full article ">Figure 4
<p>Some examples from CoopKnowledge’s A dataset.</p>
Full article ">Figure 5
<p>Experimental accuracy data results of YOLOv8s-CPG. (<b>a</b>) Precision data; (<b>b</b>) Detection accuracy data.</p>
Full article ">Figure 5 Cont.
<p>Experimental accuracy data results of YOLOv8s-CPG. (<b>a</b>) Precision data; (<b>b</b>) Detection accuracy data.</p>
Full article ">Figure 6
<p>Ablation experimental data results of YOLOv8s-CIB+PSA+GY (YOLOv8s-CPG).</p>
Full article ">Figure 7
<p>Comparison of model indicators of yolo series.</p>
Full article ">Figure 8
<p>Comparison of YOLO8-CPG in a real underwater environment.</p>
Full article ">Figure 9
<p>Comparison of YOLO8-CPG on the WildFish dataset.</p>
Full article ">
23 pages, 5508 KiB  
Article
YOLO-DroneMS: Multi-Scale Object Detection Network for Unmanned Aerial Vehicle (UAV) Images
by Xueqiang Zhao and Yangbo Chen
Drones 2024, 8(11), 609; https://doi.org/10.3390/drones8110609 - 24 Oct 2024
Viewed by 1067
Abstract
In recent years, research on Unmanned Aerial Vehicles (UAVs) has developed rapidly. Compared to traditional remote-sensing images, UAV images exhibit complex backgrounds, high resolution, and large differences in object scales. Therefore, UAV object detection is an essential yet challenging task. This paper proposes [...] Read more.
In recent years, research on Unmanned Aerial Vehicles (UAVs) has developed rapidly. Compared to traditional remote-sensing images, UAV images exhibit complex backgrounds, high resolution, and large differences in object scales. Therefore, UAV object detection is an essential yet challenging task. This paper proposes a multi-scale object detection network, namely YOLO-DroneMS (You Only Look Once for Drone Multi-Scale Object), for UAV images. Targeting the pivotal connection between the backbone and neck, the Large Separable Kernel Attention (LSKA) mechanism is adopted with the Spatial Pyramid Pooling Factor (SPPF), where weighted processing of multi-scale feature maps is performed to focus more on features. And Attentional Scale Sequence Fusion DySample (ASF-DySample) is introduced to perform attention scale sequence fusion and dynamic upsampling to conserve resources. Then, the faster cross-stage partial network bottleneck with two convolutions (named C2f) in the backbone is optimized using the Inverted Residual Mobile Block and Dilated Reparam Block (iRMB-DRB), which balances the advantages of dynamic global modeling and static local information fusion. This optimization effectively increases the model’s receptive field, enhancing its capability for downstream tasks. By replacing the original CIoU with WIoUv3, the model prioritizes anchoring boxes of superior quality, dynamically adjusting weights to enhance detection performance for small objects. Experimental findings on the VisDrone2019 dataset demonstrate that at an Intersection over Union (IoU) of 0.5, YOLO-DroneMS achieves a 3.6% increase in mAP@50 compared to the YOLOv8n model. Moreover, YOLO-DroneMS exhibits improved detection speed, increasing the number of frames per second (FPS) from 78.7 to 83.3. The enhanced model supports diverse target scales and achieves high recognition rates, making it well-suited for drone-based object detection tasks, particularly in scenarios involving multiple object clusters. Full article
(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)
Show Figures

Figure 1

Figure 1
<p>YOLOv8 network structure.</p>
Full article ">Figure 2
<p>Model architecture of YOLO-DroneMS. SPPF-LSKA is adopted to perform weighted processing on multi-scale feature maps, while C2f-iRMB-DRB is used for balancing the advantages of dynamic global modeling and local information fusion. Additionally, ASF-DySample is utilized to dynamically upsample the neck section.</p>
Full article ">Figure 3
<p>Structure of LSKA and SPPF-LSKA. The LSKA mechanism is applied after SPPF to perform weighted processing on multi-scale feature maps.</p>
Full article ">Figure 4
<p>Point sampling of DySample. DySample dynamically adjusts sampling points by summing offsets with original grid positions.</p>
Full article ">Figure 5
<p>Architecture of the ASF-DySample, in which the DynamicScalSeq module merges the features from shallow feature maps by the Add operation.</p>
Full article ">Figure 6
<p>Structure of iRMB network. iRMB fuses lightweight CNN architectures and attention-based model structures.</p>
Full article ">Figure 7
<p>Structure of the iRMB-DRB network.</p>
Full article ">Figure 8
<p>Structure of the C2f-iRMB-DRB submodule. The iRMB-DRB network enhances the model’s capability to handle features at various scales.</p>
Full article ">Figure 9
<p>Total number of object instances in the VisDrone2019 dataset.</p>
Full article ">Figure 10
<p>Comparison of mAP@50 for different models on the VisDrone2019 dataset.</p>
Full article ">Figure 11
<p>Comparison of confusion matrices of YOLOv8 and YOLO-DroneMS. It is seen that the “pedestrian” class has an increase of 9%, and the “people” and “car” categories have improved by 8% and 7%, respectively.</p>
Full article ">Figure 11 Cont.
<p>Comparison of confusion matrices of YOLOv8 and YOLO-DroneMS. It is seen that the “pedestrian” class has an increase of 9%, and the “people” and “car” categories have improved by 8% and 7%, respectively.</p>
Full article ">Figure 12
<p>Visualization of the comparison of different models.</p>
Full article ">Figure 13
<p>Statistical plot of the classes in the RiverInspect-2024 dataset.</p>
Full article ">Figure 14
<p>Examples of RiverInspect-2024. It is used for a river drone-inspection project.</p>
Full article ">
25 pages, 2849 KiB  
Article
Enhanced Hybrid U-Net Framework for Sophisticated Building Automation Extraction Utilizing Decay Matrix
by Ting Wang, Zhuyi Gong, Anqi Tang, Qian Zhang and Yun Ge
Buildings 2024, 14(11), 3353; https://doi.org/10.3390/buildings14113353 - 23 Oct 2024
Viewed by 582
Abstract
Automatically extracting buildings from remote sensing imagery using deep learning techniques has become essential for various real-world applications. However, mainstream methods often encounter difficulties in accurately extracting and reconstructing fine-grained features due to the heterogeneity and scale variations in building appearances. To address [...] Read more.
Automatically extracting buildings from remote sensing imagery using deep learning techniques has become essential for various real-world applications. However, mainstream methods often encounter difficulties in accurately extracting and reconstructing fine-grained features due to the heterogeneity and scale variations in building appearances. To address these challenges, we propose LDFormer, an advanced building segmentation model based on linear decay. LDFormer introduces a multi-scale detail fusion bridge (MDFB), which dynamically integrates shallow features to enhance the representation of local details and capture fine-grained local features effectively. To improve global feature extraction, the model incorporates linear decay self-attention (LDSA) and depthwise large separable kernel multi-layer perceptron (DWLSK-MLP) optimizations in the decoder. Specifically, LDSA employs a linear decay matrix within the self-attention mechanism to address long-distance dependency issues, while DWLSK-MLP utilizes step-wise convolutions to achieve a large receptive field. The proposed method has been evaluated on the Massachusetts, Inria, and WHU building datasets, achieving IoU scores of 76.10%, 82.87%, and 91.86%, respectively. LDFormer demonstrates superior performance compared to existing state-of-the-art methods in building segmentation tasks, showcasing its significant potential for building automation extraction. Full article
Show Figures

Figure 1

Figure 1
<p>Some challenges in remote sensing images. (<b>a</b>) Buildings vary in size, shape, texture, and color. (<b>b</b>) Shadows obscure the buildings in remote sensing images.</p>
Full article ">Figure 2
<p>An overview of the LDFormer model.</p>
Full article ">Figure 3
<p>The structure of LDBlock. (<b>a</b>) Block in Swin-Transformer. (<b>b</b>) LDBlock in LDFormer.</p>
Full article ">Figure 4
<p>The structure of Multi-scale Detail Fusion Bridge (MDFB).</p>
Full article ">Figure 5
<p>Illustration of the LDSA strategy. Different colors represent different weight values for the data; the closer to the token center, the greater the data weight value.</p>
Full article ">Figure 6
<p>(<b>a</b>) DW-MLP. (<b>b</b>) MS-MLP. (<b>c</b>) Our DWLSK-MLP.</p>
Full article ">Figure 7
<p>Qualitative comparison under Massachusetts test sets. We added some red boxes to highlight the differences in order to facilitate model comparison.</p>
Full article ">Figure 8
<p>Visualization of large image inference on the Massachusetts dataset.</p>
Full article ">Figure 9
<p>Qualitative comparison under WHU (<b>left</b>) and Inria (<b>right</b>) test sets.</p>
Full article ">Figure 10
<p>Ablation analysis of the impact of the number of model heads and window size using the Inria building dataset.</p>
Full article ">Figure 11
<p>Model complexity comparison of LDFormer on Inria dataset.</p>
Full article ">
16 pages, 21131 KiB  
Article
GCS-YOLOv8: A Lightweight Face Extractor to Assist Deepfake Detection
by Ruifang Zhang, Bohan Deng, Xiaohui Cheng and Hong Zhao
Sensors 2024, 24(21), 6781; https://doi.org/10.3390/s24216781 - 22 Oct 2024
Viewed by 522
Abstract
To address the issues of target feature blurring and increased false detections caused by high compression rates in deepfake videos, as well as the high computational resource requirements of existing face extractors, we propose a lightweight face extractor to assist deepfake detection, GCS-YOLOv8. [...] Read more.
To address the issues of target feature blurring and increased false detections caused by high compression rates in deepfake videos, as well as the high computational resource requirements of existing face extractors, we propose a lightweight face extractor to assist deepfake detection, GCS-YOLOv8. Firstly, we employ the HGStem module for initial downsampling to address the issue of false detections of small non-face objects in deepfake videos, thereby improving detection accuracy. Secondly, we introduce the C2f-GDConv module to mitigate the low-FLOPs pitfall while reducing the model’s parameters, thereby lightening the network. Additionally, we add a new P6 large target detection layer to expand the receptive field and capture multi-scale features, solving the problem of detecting large-scale faces in low-compression deepfake videos. We also design a cross-scale feature fusion module called CCFG (CNN-based Cross-Scale Feature Fusion with GDConv), which integrates features from different scales to enhance the model’s adaptability to scale variations while reducing network parameters, addressing the high computational resource requirements of traditional face extractors. Furthermore, we improve the detection head by utilizing group normalization and shared convolution, simplifying the process of face detection while maintaining detection performance. The training dataset was also refined by removing low-accuracy and low-resolution labels, which reduced the false detection rate. Experimental results demonstrate that, compared to YOLOv8, this face extractor achieves the AP of 0.942, 0.927, and 0.812 on the WiderFace dataset’s Easy, Medium, and Hard subsets, representing improvements of 1.1%, 1.3%, and 3.7% respectively. The model’s parameters and FLOPs are only 1.68 MB and 3.5 G, reflecting reductions of 44.2% and 56.8%, making it more effective and lightweight in extracting faces from deepfake videos. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

Figure 1
<p>Structure of the YOLOv8.</p>
Full article ">Figure 2
<p>Structure of the GCS-YOLOv8.</p>
Full article ">Figure 3
<p>Structure of the HGStem.</p>
Full article ">Figure 4
<p>Structure of the C2f-GDConv.</p>
Full article ">Figure 5
<p>Structure of the Detect head and the GSCD.</p>
Full article ">Figure 6
<p>Comparison of detection effects on WiderFace test sets.</p>
Full article ">Figure 7
<p>Comparison of detection effects on Celeb-DF-v2 and FF++.</p>
Full article ">
16 pages, 7311 KiB  
Article
Vehicle Localization Method in Complex SAR Images Based on Feature Reconstruction and Aggregation
by Jinwei Han, Lihong Kang, Jing Tian, Mingyong Jiang and Ningbo Guo
Sensors 2024, 24(20), 6746; https://doi.org/10.3390/s24206746 - 20 Oct 2024
Viewed by 670
Abstract
Due to the small size of vehicle targets, complex background environments, and the discrete scattering characteristics of high-resolution synthetic aperture radar (SAR) images, existing deep learning networks face challenges in extracting high-quality vehicle features from SAR images, which impacts vehicle localization accuracy. To [...] Read more.
Due to the small size of vehicle targets, complex background environments, and the discrete scattering characteristics of high-resolution synthetic aperture radar (SAR) images, existing deep learning networks face challenges in extracting high-quality vehicle features from SAR images, which impacts vehicle localization accuracy. To address this issue, this paper proposes a vehicle localization method for SAR images based on feature reconstruction and aggregation with rotating boxes. Specifically, our method first employs a backbone network that integrates the space-channel reconfiguration module (SCRM), which contains spatial and channel attention mechanisms specifically designed for SAR images to extract features. The network then connects a progressive cross-fusion mechanism (PCFM) that effectively combines multi-view features from different feature layers, enhancing the information content of feature maps and improving feature representation quality. Finally, these features containing a large receptive field region and enhanced rich contextual information are input into a rotating box vehicle detection head, which effectively reduces false alarms and missed detections. Experiments on a complex scene SAR image vehicle dataset demonstrate that the proposed method significantly improves vehicle localization accuracy. Our method achieves state-of-the-art performance, which demonstrates the superiority and effectiveness of the proposed method. Full article
(This article belongs to the Special Issue Intelligent SAR Target Detection and Recognition)
Show Figures

Figure 1

Figure 1
<p>The pipeline of our proposed network. We insert space-channel reconstruction module into the backbone network, design the new progressive cross-fusion mechanism, and insert feature aggregation module into it.</p>
Full article ">Figure 2
<p>Space-channel reconstruction module. (<b>a</b>) The overall structure of the space-channel reconstruction module. (<b>b</b>) Spatial attention. (<b>c</b>) Channel attention.</p>
Full article ">Figure 3
<p>Overall structure of the progressive cross-fusion mechanism (PCFM). (<b>a</b>) The structure of the PCFM. (<b>b</b>) The structure of the FAM.</p>
Full article ">Figure 4
<p>Rotating box vehicle detection head.</p>
Full article ">Figure 5
<p>Partial samples of self-made dataset and Mix MSTAR. (<b>a</b>,<b>b</b>) are from the self-made dataset, and (<b>c</b>,<b>d</b>) are from Mix MSTAR.</p>
Full article ">Figure 6
<p>Comparison of experimental results of different methods on our SAR vehicle dataset. (<b>a</b>) has vegetation interference; (<b>b</b>,<b>c</b>) have strong scattering from buildings; (<b>d</b>) is pure background. The blue boxes represent ground truths, while the green boxes denote detected vehicles. False alarms are circled by red ovals, while missing vehicles are circled by yellow ovals. The first row indicates the ground truth, and the second row to the ninth row indicate the detection result of Rotated Faster R-CNN, Gliding Vertex, KLD, GWD, S<sup>2</sup>A-Net, Oriented RepPoints, KFIoU, and our method.</p>
Full article ">
18 pages, 5855 KiB  
Article
Scalability Analysis of LoRa and Sigfox in Congested Environment and Calculation of Optimum Number of Nodes
by Mandeep Malik, Ashwin Kothari and Rashmi Pandhare
Sensors 2024, 24(20), 6673; https://doi.org/10.3390/s24206673 - 17 Oct 2024
Viewed by 815
Abstract
Low-power wide area network (LPWAN) technologies as part of IoT are gaining a lot of attention as they provide affordable communication over large areas. LoRa and Sigfox as part of LPWAN have emerged as highly effective and promising non-3GPP unlicensed band IoT technologies [...] Read more.
Low-power wide area network (LPWAN) technologies as part of IoT are gaining a lot of attention as they provide affordable communication over large areas. LoRa and Sigfox as part of LPWAN have emerged as highly effective and promising non-3GPP unlicensed band IoT technologies while challenging the supremacy of cellular technologies for machine-to-machine-(M2M)-based use cases. This paper presents the design goals of LoRa and Sigfox while throwing light on their suitability in congested environments. A practical traffic generator of both LoRa and Sigfox is introduced and further interpolated for understanding simultaneous operation of 100 to 10,000 such nodes in close vicinity while establishing deep understanding on effects of collision, re-transmissions, and link behaviour. Previous work in this field have overlooked simultaneous deployment, collision issues, effects of re-transmission, and propagation profile while arriving at a number of successful receptions. This work uses packet error rate (PER) and delivery ratio, which are correct metrics to calculate successful transmissions. The obtained results show that a maximum of 100 LoRa and 200 Sigfox nodes can be deployed in a fixed transmission use case over an area of up to 1 km. As part of the future scope, solutions have been suggested to increase the effectiveness of LoRa and Sigfox networks. Full article
Show Figures

Figure 1

Figure 1
<p>Deployment Architecture of LoRa and Sigfox.</p>
Full article ">Figure 2
<p>Two LoRa Radios mounted on Audrino UNO with mono-pole antenna and Dragino single channel LoRa gateway.</p>
Full article ">Figure 3
<p>(<b>a</b>) Two LoRa devices based on ESP32, (<b>b</b>) Sigfox device TD1207.</p>
Full article ">Figure 4
<p>Spectrum analyser output at 868 MHz of LoRaR radios shown in <a href="#sensors-24-06673-f002" class="html-fig">Figure 2</a>.</p>
Full article ">Figure 5
<p>Test setup of LoRa devices as shown in <a href="#sensors-24-06673-f002" class="html-fig">Figure 2</a> and <a href="#sensors-24-06673-f003" class="html-fig">Figure 3</a>a,b.</p>
Full article ">Figure 6
<p>LoRa packet collision simulation with 1000 devices transmitting randomly.</p>
Full article ">Figure 7
<p>LoRa packet collision simulation with 5000 Devices.</p>
Full article ">Figure 8
<p>Sigfox simulation for 5000 devices with three gateway.</p>
Full article ">Figure 9
<p>Sigfox simulation for 10000 devices with three gateway.</p>
Full article ">Figure 10
<p>Sigfox simulation for with 10,000 devices with one gateway i.e., not utilizing spatial diversity.</p>
Full article ">Figure 11
<p>LoRa PER and collision while using SF7, best case (36 ms).</p>
Full article ">Figure 12
<p>LoRa PER and collision while using SF12, worst case (682 ms).</p>
Full article ">
17 pages, 3107 KiB  
Article
CL-YOLOv8: Crack Detection Algorithm for Fair-Faced Walls Based on Deep Learning
by Qinjun Li, Guoyu Zhang and Ping Yang
Appl. Sci. 2024, 14(20), 9421; https://doi.org/10.3390/app14209421 - 16 Oct 2024
Viewed by 849
Abstract
Cracks pose a critical challenge in the preservation of historical buildings worldwide, particularly in fair-faced walls, where timely and accurate detection is essential to prevent further degradation. Traditional image processing methods have proven inadequate for effectively detecting building cracks. Despite global advancements in [...] Read more.
Cracks pose a critical challenge in the preservation of historical buildings worldwide, particularly in fair-faced walls, where timely and accurate detection is essential to prevent further degradation. Traditional image processing methods have proven inadequate for effectively detecting building cracks. Despite global advancements in deep learning, crack detection under diverse environmental and lighting conditions remains a significant technical hurdle, as highlighted by recent international studies. To address this challenge, we propose an enhanced crack detection algorithm, CL-YOLOv8 (ConvNeXt V2-LSKA-YOLOv8). By integrating the well-established ConvNeXt V2 model as the backbone network into YOLOv8, the algorithm benefits from advanced feature extraction techniques, leading to a superior detection accuracy. This choice leverages ConvNeXt V2’s recognized strengths, providing a robust foundation for improving the overall model performance. Additionally, by introducing the LSKA (Large Separable Kernel Attention) mechanism into the SPPF structure, the feature receptive field is enlarged and feature correlations are strengthened, further enhancing crack detection accuracy in diverse environments. This study also contributes to the field by significantly expanding the dataset for fair-faced wall crack detection, increasing its size sevenfold through data augmentation and the inclusion of additional data. Our experimental results demonstrate that CL-YOLOv8 outperforms mainstream algorithms such as Faster R-CNN, YOLOv5s, YOLOv7-tiny, SSD, and various YOLOv8n/s/m/l/x models. CL-YOLOv8 achieves an accuracy of 85.3%, a recall rate of 83.2%, and a mean average precision (mAP) of 83.7%. Compared to the YOLOv8n base model, CL-YOLOv8 shows improvements of 0.9%, 2.3%, and 3.9% in accuracy, recall rate, and mAP, respectively. These results underscore the effectiveness and superiority of CL-YOLOv8 in crack detection, positioning it as a valuable tool in the global effort to preserve architectural heritage. Full article
Show Figures

Figure 1

Figure 1
<p>The system diagram of YOLOv8n.</p>
Full article ">Figure 2
<p>System diagram comparing LKA and LSKA modules.</p>
Full article ">Figure 3
<p>LSKA module.</p>
Full article ">Figure 4
<p>Diagram of the LSKA module structure.</p>
Full article ">Figure 5
<p>Comparison diagram of ConvNeXt V1 and V2 modules.</p>
Full article ">Figure 6
<p>Improved system architecture diagram.</p>
Full article ">Figure 7
<p>Comparison of P–R figures before and after improvement.</p>
Full article ">Figure 8
<p>Comparison of F1-confidence curves before and after improvement.</p>
Full article ">Figure 9
<p>The detection effect of different models. YOLOv5s (<b>a</b>), YOLOv7-tiny (<b>b</b>), YOLOv8n (<b>c</b>), and CL-YOLOv8 (<b>d</b>).</p>
Full article ">
36 pages, 17153 KiB  
Article
YOLO-RWY: A Novel Runway Detection Model for Vision-Based Autonomous Landing of Fixed-Wing Unmanned Aerial Vehicles
by Ye Li, Yu Xia, Guangji Zheng, Xiaoyang Guo and Qingfeng Li
Drones 2024, 8(10), 571; https://doi.org/10.3390/drones8100571 - 10 Oct 2024
Viewed by 1202
Abstract
In scenarios where global navigation satellite systems (GNSSs) and radio navigation systems are denied, vision-based autonomous landing (VAL) for fixed-wing unmanned aerial vehicles (UAVs) becomes essential. Accurate and real-time runway detection in VAL is vital for providing precise positional and orientational guidance. However, [...] Read more.
In scenarios where global navigation satellite systems (GNSSs) and radio navigation systems are denied, vision-based autonomous landing (VAL) for fixed-wing unmanned aerial vehicles (UAVs) becomes essential. Accurate and real-time runway detection in VAL is vital for providing precise positional and orientational guidance. However, existing research faces significant challenges, including insufficient accuracy, inadequate real-time performance, poor robustness, and high susceptibility to disturbances. To address these challenges, this paper introduces a novel single-stage, anchor-free, and decoupled vision-based runway detection framework, referred to as YOLO-RWY. First, an enhanced data augmentation (EDA) module is incorporated to perform various augmentations, enriching image diversity, and introducing perturbations that improve generalization and safety. Second, a large separable kernel attention (LSKA) module is integrated into the backbone structure to provide a lightweight attention mechanism with a broad receptive field, enhancing feature representation. Third, the neck structure is reorganized as a bidirectional feature pyramid network (BiFPN) module with skip connections and attention allocation, enabling efficient multi-scale and across-stage feature fusion. Finally, the regression loss and task-aligned learning (TAL) assigner are optimized using efficient intersection over union (EIoU) to improve localization evaluation, resulting in faster and more accurate convergence. Comprehensive experiments demonstrate that YOLO-RWY achieves AP50:95 scores of 0.760, 0.611, and 0.413 on synthetic, real nominal, and real edge test sets of the landing approach runway detection (LARD) dataset, respectively. Deployment experiments on an edge device show that YOLO-RWY achieves an inference speed of 154.4 FPS under FP32 quantization with an image size of 640. The results indicate that the proposed YOLO-RWY model possesses strong generalization and real-time capabilities, enabling accurate runway detection in complex and challenging visual environments, and providing support for the onboard VAL systems of fixed-wing UAVs. Full article
Show Figures

Figure 1

Figure 1
<p>Schematic diagram of the proposed runway detection model and framework.</p>
Full article ">Figure 2
<p>Schematic diagram of YOLOv8.</p>
Full article ">Figure 3
<p>Schematic diagram of YOLO-RWY.</p>
Full article ">Figure 4
<p>Structure of LSKA module.</p>
Full article ">Figure 5
<p>Structure of BiFPN module.</p>
Full article ">Figure 6
<p>Schematic diagram of EIoU.</p>
Full article ">Figure 7
<p>Distribution of airport geospatial locations.</p>
Full article ">Figure 8
<p>Distribution of aircraft positions: (<b>a</b>) Vertical profile along the glide slope; (<b>b</b>) Horizontal profile along the localizer.</p>
Full article ">Figure 9
<p>Distribution of bounding box characteristics: (<b>a</b>) Areas; (<b>b</b>) Aspect ratios. The orange dashed line represents the mean, while the green solid line represents the median.</p>
Full article ">Figure 10
<p>Distribution of normalized bounding box centers.</p>
Full article ">Figure 11
<p>Convergence of the proposed YOLO-RWY: (<b>a</b>) Training loss; (<b>b</b>) Validation loss; (<b>c</b>) Validation metrics.</p>
Full article ">Figure 12
<p>Variation of accuracy with landing distance and time to landing: (<b>a</b>) YOLOv8n; (<b>b</b>) Proposed model.</p>
Full article ">Figure 13
<p>Runway detections under different edge scenarios: (<b>a</b>) Rain; (<b>b</b>) Snow; (<b>c</b>) Fog; (<b>d</b>) Backlight; (<b>e</b>) Low light.</p>
Full article ">Figure 14
<p>Runway detections with enlarged views for predicted bounding boxes.</p>
Full article ">Figure A1
<p>Enhanced samples under diverse weather conditions: (<b>a</b>) Original image; (<b>b</b>) Rain; (<b>c</b>) Snow; (<b>d</b>) Fog; (<b>e</b>) Shadow; (<b>f</b>) Sun flare.</p>
Full article ">Figure A2
<p>Enhanced samples using EDA module: (<b>a</b>) Original image; (<b>b</b>–<b>f</b>) Augmented images.</p>
Full article ">
20 pages, 6554 KiB  
Article
An Efficient UAV Image Object Detection Algorithm Based on Global Attention and Multi-Scale Feature Fusion
by Rui Qian and Yong Ding
Electronics 2024, 13(20), 3989; https://doi.org/10.3390/electronics13203989 - 10 Oct 2024
Viewed by 1356
Abstract
Object detection technology holds significant promise in unmanned aerial vehicle (UAV) applications. However, traditional methods face challenges in detecting denser, smaller, and more complex targets within UAV aerial images. To address issues such as target occlusion and dense small objects, this paper proposes [...] Read more.
Object detection technology holds significant promise in unmanned aerial vehicle (UAV) applications. However, traditional methods face challenges in detecting denser, smaller, and more complex targets within UAV aerial images. To address issues such as target occlusion and dense small objects, this paper proposes a multi-scale object detection algorithm based on YOLOv5s. A novel feature extraction module, DCNCSPELAN4, which combines CSPNet and ELAN, is introduced to enhance the receptive field of feature extraction while maintaining network efficiency. Additionally, a lightweight Vision Transformer module, the CloFormer Block, is integrated to provide the network with a global receptive field. Moreover, the algorithm incorporates a three-scale feature fusion (TFE) module and a scale sequence feature fusion (SSFF) module in the neck network to effectively leverage multi-scale spatial information across different feature maps. To address dense small objects, an additional small object detection head was added to the detection layer. The original large object detection head was removed to reduce computational load. The proposed algorithm has been evaluated through ablation experiments and compared with other state-of-the-art methods on the VisDrone2019 and AU-AIR datasets. The results demonstrate that our algorithm outperforms other baseline methods in terms of both accuracy and speed. Compared to the YOLOv5s baseline model, the enhanced algorithm achieves improvements of 12.4% and 8.4% in AP50 and AP metrics, respectively, with only a marginal parameter increase of 0.3 M. These experiments validate the effectiveness of our algorithm for object detection in drone imagery. Full article
Show Figures

Figure 1

Figure 1
<p>Sample images taken from UAVs.</p>
Full article ">Figure 2
<p>Improved YOLOv5 network structure.</p>
Full article ">Figure 3
<p>The structure of (<b>a</b>) CSPNet, (<b>b</b>) ELAN, and (<b>c</b>) CSPELAN. CSPELAN extends the convolution module in ELAN to arbitrary computable modules modeled after CSPNet.</p>
Full article ">Figure 4
<p>The overall structure of DCNCSPELAN4. For the DCN structure. The gray grid simulates the distribution of targets in aerial photography, while the solid circle and hollow circle represent the receptive fields of DCN and regular convolutions in UAV images, respectively.</p>
Full article ">Figure 5
<p>The structure of CloFormer Block, consisting of a global branch and a local branch.</p>
Full article ">Figure 6
<p>Structure of the three-scale feature fusion operation (<b>a</b>) SSFF and (<b>b</b>) THE.</p>
Full article ">Figure 7
<p>Experimental results for all categories of Visdrone2019-test.</p>
Full article ">Figure 8
<p>Confusion matrix of (<b>a</b>) the original YOLOv5s model, (<b>b</b>) TPH-YOLOv5, and (<b>c</b>) the improved YOLOv5s model.</p>
Full article ">Figure 9
<p>Performance of the original YOLOv5, TPH-YOLOv5, and the proposed algorithm on the AU-AIR and Visdrone2019-test. (<b>a</b>) Original YOLOv5; (<b>b</b>) TPH-YOLOv5; (<b>c</b>) the proposed algorithm.</p>
Full article ">Figure 10
<p>The detection effect of our improved YOLOv5s in the urban traffic supervision scenario. (<b>a</b>) Daytime urban traffic scenario; (<b>b</b>) nighttime urban traffic scenario.</p>
Full article ">Figure 11
<p>The detection effect of our improved YOLOv5s in the urban streets supervision scenario. (<b>a</b>) Daytime urban streets scenario; (<b>b</b>) nighttime urban streets scenario.</p>
Full article ">
21 pages, 10847 KiB  
Article
DLCH-YOLO: An Object Detection Algorithm for Monitoring the Operation Status of Circuit Breakers in Power Scenarios
by Riben Shu, Lihua Chen, Lumei Su, Tianyou Li and Fan Yin
Electronics 2024, 13(19), 3949; https://doi.org/10.3390/electronics13193949 - 7 Oct 2024
Viewed by 757
Abstract
In the scenario of power system monitoring, detecting the operating status of circuit breakers is often inaccurate due to variable object scales and background interference. This paper introduces DLCH-YOLO, an object detection algorithm aimed at identifying the operating status of circuit breakers. Firstly, [...] Read more.
In the scenario of power system monitoring, detecting the operating status of circuit breakers is often inaccurate due to variable object scales and background interference. This paper introduces DLCH-YOLO, an object detection algorithm aimed at identifying the operating status of circuit breakers. Firstly, we propose a novel C2f_DLKA module based on Deformable Large Kernel Attention. This module adapts to objects of varying scales within a large receptive field, thereby more effectively extracting multi-scale features. Secondly, we propose a Semantic Screening Feature Pyramid Network designed to fuse multi-scale features. By filtering low-level semantic information, it effectively suppresses background interference to enhance localization accuracy. Finally, the feature extraction network incorporates Generalized-Sparse Convolution, which combines depth-wise separable convolution and channel mixing operations, reducing computational load. The DLCH-YOLO algorithm achieved a 91.8% mAP on our self-built power equipment dataset, representing a 4.7% improvement over the baseline network Yolov8. With its superior detection accuracy and real-time performance, DLCH-YOLO outperforms mainstream detection algorithms. This algorithm provides an efficient and viable solution for circuit breaker status detection. Full article
Show Figures

Figure 1

Figure 1
<p>The architecture of DLCH-YOLO network.</p>
Full article ">Figure 2
<p>C2f module in the feature extraction network. (<b>a</b>) shows the base module; (<b>b</b>) shows the improved module that introduces large kernel convolution; (<b>c</b>) combines the improved DLKABottle module with C2f.</p>
Full article ">Figure 3
<p>Generalized Sparse Convolution.</p>
Full article ">Figure 4
<p>The Framework of Semantic Screening Feature Pyramid Network.</p>
Full article ">Figure 5
<p>Power Equipment Dataset. Due to confidentiality requirements, the above images have been desensitized.</p>
Full article ">Figure 6
<p>Ablation visualization. All images have been desensitized.</p>
Full article ">Figure 7
<p>The detection results of the improved method on the power equipment dataset for identifying the working status of circuit breakers. (<b>a</b>–<b>e</b>) represent detection results from different monitoring perspectives, while (<b>f</b>) represents a complex background.</p>
Full article ">
20 pages, 3829 KiB  
Article
Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization
by Yao Du, Taiying Peng and Xiaohui Hu
Appl. Sci. 2024, 14(19), 8937; https://doi.org/10.3390/app14198937 - 4 Oct 2024
Viewed by 757
Abstract
The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the [...] Read more.
The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perception of traits at different granularities, such as eye gaze and lip shape, which are more detailed, or posture and gestures, which are more macroscopic. Exploring the efficient fusion of visual information of different granularities is crucial for accurate sign language recognition. In addition, applying a vanilla Transformer to sequence modeling in cSLR exhibits weak performance because specific video frames could interfere with the attention mechanism. These limitations constrain the capability to understand potential semantic characteristics. We introduce a feature fusion method for integrating visual features of disparate granularities and refine the metric of attention to enhance the Transformer’s comprehension of video content. Specifically, we extract CNN feature maps with varying receptive fields and employ a self-attention mechanism to fuse feature maps of different granularities, thereby obtaining multi-scale spatial features of the sign language framework. As for video modeling, we first analyze why the vanilla Transformer failed in cSLR and observe that the magnitude of the feature vectors of video frames could interfere with the distribution of attention weights. Therefore, we utilize the Euclidean distance among vectors to measure the attention weights instead of scaled-dot to enhance dynamic temporal modeling capabilities. Finally, we integrate the two components to construct the model MSF-ET (Multi-Scaled feature Fusion–Euclidean Transformer) for cSLR and train the model end-to-end. We perform experiments on large-scale cSLR benchmarks—PHOENIX-2014 and Chinese Sign Language (CSL)—to validate the effectiveness. Full article
Show Figures

Figure 1

Figure 1
<p>Currently, most video spatial representation methods for cSLR extract features by pre-trained CNN backbone networks ((<b>left</b>) in the figure). Although the approach can extract high-level semantic information, it lacks perception of details, such as mouth shape and gaze, which are important for understanding sign language. We propose a multi-scale feature fusion method based on self-attention mechanism ((<b>right</b>) in figure), which enables more comprehensive extraction of semantic information.</p>
Full article ">Figure 2
<p>Overall model architecture. Our proposed MSF-ET model consists of three main components: spatial encoder, feature fusion module, and temporal encoder. The spatial encoder is composed of multiple 2D convolutional layers, followed by max-pooling to downsample the feature maps with different receptive fields. The feature fusion module uses a self-attention mechanism to fuse the multi-scaled features of the frames. The temporal encoder is composed of the encoder based on Euclidean distance self-attention model and local Transformer layer. The encoder learns the contextual information of the video and the local features for glosses alignment. Finally, connectionist temporal classification (CTC) is used to train the model and decode the gloss sequences.</p>
Full article ">Figure 3
<p>Multi-scaled features integration and fusion. The spatial encoder outputs feature maps of sizes 3 and 7, respectively. These feature maps are first flattened into 1D vectors. Then, a special token <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>c</mi> <mi>l</mi> <mi>s</mi> <mo>]</mo> </mrow> </semantics></math> is added to the head of the vector, similar to ViT. Next, the flattened vectors are added with trainable position embedding and then utilize the Transformer encoder to obtain the global context information of both feature maps. Finally, the two <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>c</mi> <mi>l</mi> <mi>s</mi> <mo>]</mo> </mrow> </semantics></math> are concatenated to achieve multi-scale feature fusion.</p>
Full article ">Figure 4
<p>The demo for the attention map of vanilla Transformer. The heatmap denotes the attention scores, and the bar above the heatmap is the magnitude of key vectors. The figure indicates that the distribution of attention weights is overly concentrated in regions where the key vectors have longer magnitudes, thereby drowning out information from other positions and hindering the Transformer’s ability to fully comprehend the global information within the sequence.</p>
Full article ">Figure 5
<p>The detail about self-attention with Euclidean distance and local window. We assume that the window size is 5. Therefore, every frame interacts with others by attention mechanism in the window centered on itself.</p>
Full article ">Figure 6
<p>The CAM visualization of attention weights corresponding to feature maps at different scales. We applied this visualization to videos of different sign language performers to demonstrate the generalizability of the results. ((<b>A</b>) is sourced from ‘01April_2010_Thursday_heute_default-1’ in the PHOENIX2014 validation set. (<b>B</b>) is sourced from ‘03November_2010_Wednesday_tagesschau_default-7’ in the PHOENIX-2014 validation set.)</p>
Full article ">Figure 7
<p>An example of attention weight visualization for a Transformer utilizing Euclidean distance-based metrics. The sample data used in this figure are consistent with those in <a href="#applsci-14-08937-f004" class="html-fig">Figure 4</a>. It is evident that the use of Euclidean distance significantly alleviates the phenomenon of attention sparsity.</p>
Full article ">Figure 8
<p>Relationship between inference time and video sequence length during model inference.</p>
Full article ">
Back to TopTop