Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (541)

Search Parameters:
Keywords = dilated Convolution

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 3921 KiB  
Article
CFF-Net: Cross-Hierarchy Feature Fusion Network Based on Composite Dual-Channel Encoder for Surface Defect Segmentation
by Ke’er Qian, Xiaokang Ding, Xiaoliang Jiang, Yingyu Ji and Ling Dong
Electronics 2024, 13(23), 4714; https://doi.org/10.3390/electronics13234714 - 28 Nov 2024
Abstract
In industries spanning manufacturing to software development, defect segmentation is essential for maintaining high standards of product quality and reliability. However, traditional segmentation methods often struggle to accurately identify defects due to challenges like noise interference, occlusion, and feature overlap. To solve these [...] Read more.
In industries spanning manufacturing to software development, defect segmentation is essential for maintaining high standards of product quality and reliability. However, traditional segmentation methods often struggle to accurately identify defects due to challenges like noise interference, occlusion, and feature overlap. To solve these problems, we propose a cross-hierarchy feature fusion network based on a composite dual-channel encoder for surface defect segmentation, called CFF-Net. Specifically, in the encoder of CFF-Net, we design a composite dual-channel module (CDCM), which combines standard convolution with dilated convolution and adopts a dual-path parallel structure to enhance the model’s capability in feature extraction. Then, a dilated residual pyramid module (DRPM) is integrated at the junction of the encoder and decoder, which utilizes the expansion convolution of different expansion rates to effectively capture multi-scale context information. In the final output phase, we introduce a cross-hierarchy feature fusion strategy (CFFS) that combines outputs from different layers or stages, thereby improving the robustness and generalization of the network. Finally, we conducted comparative experiments to evaluate CFF-Net against several mainstream segmentation networks across three distinct datasets: a publicly available Crack500 dataset, a self-built Bearing dataset, and another publicly available SD-saliency-900 dataset. The results demonstrated that CFF-Net consistently outperformed competing methods in segmentation tasks. Specifically, in the Crack500 dataset, CFF-Net achieved notable performance metrics, including an Mcc of 73.36%, Dice coefficient of 74.34%, and Jaccard index of 59.53%. For the Bearing dataset, it recorded an Mcc of 76.97%, Dice coefficient of 77.04%, and Jaccard index of 63.28%. Similarly, in the SD-saliency-900 dataset, CFF-Net achieved an Mcc of 84.08%, Dice coefficient of 85.82%, and Jaccard index of 75.67%. These results underscore CFF-Net’s effectiveness and reliability in handling diverse segmentation challenges across different datasets. Full article
19 pages, 780 KiB  
Article
Transformer Dil-DenseUnet: An Advanced Architecture for Stroke Segmentation
by Nesrine Jazzar, Besma Mabrouk and Ali Douik
J. Imaging 2024, 10(12), 304; https://doi.org/10.3390/jimaging10120304 - 25 Nov 2024
Viewed by 310
Abstract
We propose a novel architecture, Transformer Dil-DenseUNet, designed to address the challenges of accurately segmenting stroke lesions in MRI images. Precise segmentation is essential for diagnosing and treating stroke patients, as it provides critical spatial insights into the affected brain regions and the [...] Read more.
We propose a novel architecture, Transformer Dil-DenseUNet, designed to address the challenges of accurately segmenting stroke lesions in MRI images. Precise segmentation is essential for diagnosing and treating stroke patients, as it provides critical spatial insights into the affected brain regions and the extent of damage. Traditional manual segmentation is labor-intensive and error-prone, highlighting the need for automated solutions. Our Transformer Dil-DenseUNet combines DenseNet, dilated convolutions, and Transformer blocks, each contributing unique strengths to enhance segmentation accuracy. The DenseNet component captures fine-grained details and global features by leveraging dense connections, improving both precision and feature reuse. The dilated convolutional blocks, placed before each DenseNet module, expand the receptive field, capturing broader contextual information essential for accurate segmentation. Additionally, the Transformer blocks within our architecture address CNN limitations in capturing long-range dependencies by modeling complex spatial relationships through multi-head self-attention mechanisms. We assess our model’s performance on the Ischemic Stroke Lesion Segmentation Challenge 2015 (SISS 2015) and ISLES 2022 datasets. In the testing phase, the model achieves a Dice coefficient of 0.80 ± 0.30 on SISS 2015 and 0.81 ± 0.33 on ISLES 2022, surpassing the current state-of-the-art results on these datasets. Full article
(This article belongs to the Special Issue Advances in Medical Imaging and Machine Learning)
30 pages, 2346 KiB  
Article
A Novel Method for 3D Lung Tumor Reconstruction Using Generative Models
by Hamidreza Najafi, Kimia Savoji, Marzieh Mirzaeibonehkhater, Seyed Vahid Moravvej, Roohallah Alizadehsani and Siamak Pedrammehr
Diagnostics 2024, 14(22), 2604; https://doi.org/10.3390/diagnostics14222604 - 20 Nov 2024
Viewed by 545
Abstract
Background: Lung cancer remains a significant health concern, and the effectiveness of early detection significantly enhances patient survival rates. Identifying lung tumors with high precision is a challenge due to the complex nature of tumor structures and the surrounding lung tissues. Methods: To [...] Read more.
Background: Lung cancer remains a significant health concern, and the effectiveness of early detection significantly enhances patient survival rates. Identifying lung tumors with high precision is a challenge due to the complex nature of tumor structures and the surrounding lung tissues. Methods: To address these hurdles, this paper presents an innovative three-step approach that leverages Generative Adversarial Networks (GAN), Long Short-Term Memory (LSTM), and VGG16 algorithms for the accurate reconstruction of three-dimensional (3D) lung tumor images. The first challenge we address is the accurate segmentation of lung tissues from CT images, a task complicated by the overwhelming presence of non-lung pixels, which can lead to classifier imbalance. Our solution employs a GAN model trained with a reinforcement learning (RL)-based algorithm to mitigate this imbalance and enhance segmentation accuracy. The second challenge involves precisely detecting tumors within the segmented lung regions. We introduce a second GAN model with a novel loss function that significantly improves tumor detection accuracy. Following successful segmentation and tumor detection, the VGG16 algorithm is utilized for feature extraction, preparing the data for the final 3D reconstruction. These features are then processed through an LSTM network and converted into a format suitable for the reconstructive GAN. This GAN, equipped with dilated convolution layers in its discriminator, captures extensive contextual information, enabling the accurate reconstruction of the tumor’s 3D structure. Results: The effectiveness of our method is demonstrated through rigorous evaluation against established techniques using the LIDC-IDRI dataset and standard performance metrics, showcasing its superior performance and potential for enhancing early lung cancer detection. Conclusions:This study highlights the benefits of combining GANs, LSTM, and VGG16 into a unified framework. This approach significantly improves the accuracy of detecting and reconstructing lung tumors, promising to enhance diagnostic methods and patient results in lung cancer treatment. Full article
(This article belongs to the Special Issue AI and Digital Health for Disease Diagnosis and Monitoring)
Show Figures

Figure 1

Figure 1
<p>Overview of the proposed model: In step 1, two lungs are segmented in CT images using the GAN-based model. In step 2, the tumor is detected using the second GAN-based model. After the features are extracted by VGG16, a 3D model of the tumor is reconstructed using the third GAN in step 3.</p>
Full article ">Figure 2
<p>Architecture of the U-Ne-based generator network used for lung segmentation, illustrating the flow from input CT scan through the encoder and decoder stages to the final mask output.</p>
Full article ">Figure 3
<p>Comparative visualization of original CT scans and segmented lung regions by the proposed model.</p>
Full article ">Figure 4
<p>(<b>a</b>) Optimization of the <math display="inline"><semantics> <mi>λ</mi> </semantics></math> parameter for lung segmentation model performance, (<b>b</b>) learning trajectory of the reward optimization of the agent on 100 episodes.</p>
Full article ">Figure 5
<p>Comparative analysis of tumor detection. The top row shows the ground truth, while the bottom row presents the model predictions. The black square in every sample shows the tumor extracted by the proposed model.</p>
Full article ">Figure 6
<p>Comparison of original and reconstructed 3D tumor shapes highlighting the effectiveness of the proposed reconstruction method in preserving boundary smoothness.</p>
Full article ">Figure 7
<p>Distribution of decision-making times for 3D reconstruction in RTB environments.</p>
Full article ">Figure 8
<p>Loss trends in (<b>a</b>) lung segmentation, (<b>b</b>) tumor detection, and (<b>c</b>) 3D reconstruction models over 250 epochs.</p>
Full article ">Figure 9
<p>HD metric trends in lung segmentation, tumor detection, and 3D reconstruction over 250 epochs.</p>
Full article ">
19 pages, 18572 KiB  
Article
MSG-YOLO: A Lightweight Detection Algorithm for Clubbing Finger Detection
by Zhijie Wang, Qiao Meng, Feng Tang, Yuelin Qi, Bingyu Li, Xin Liu, Siyuan Kong and Xin Li
Electronics 2024, 13(22), 4549; https://doi.org/10.3390/electronics13224549 - 19 Nov 2024
Viewed by 458
Abstract
Clubbing finger is a significant clinical indicator, and its early detection is essential for the diagnosis and treatment of associated diseases. However, traditional diagnostic methods rely heavily on the clinician’s subjective assessment, which can be prone to biases and may lack standardized tools. [...] Read more.
Clubbing finger is a significant clinical indicator, and its early detection is essential for the diagnosis and treatment of associated diseases. However, traditional diagnostic methods rely heavily on the clinician’s subjective assessment, which can be prone to biases and may lack standardized tools. Unlike other diagnostic challenges, the characteristic changes of clubbing finger are subtle and localized, necessitating high-precision feature extraction. Existing models often fail to capture these delicate changes accurately, potentially missing crucial diagnostic features or generating false positives. Furthermore, these models are often not suited for accurate clinical diagnosis in resource-constrained settings. To address these challenges, we propose MSG-YOLO, a lightweight clubbing finger detection model based on YOLOv8n, designed to enhance both detection accuracy and efficiency. The model first employs a multi-scale dilated residual module, which expands the receptive field using dilated convolutions and residual connections, thereby improving the model’s ability to capture features across various scales. Additionally, we introduce a Selective Feature Fusion Pyramid Network (SFFPN) that dynamically selects and enhances critical features, optimizing the flow of information while minimizing redundancy. To further refine the architecture, we reconstruct the YOLOv8 detection head with group normalization and shared-parameter convolutions, significantly reducing the model’s parameter count and increasing computational efficiency. Experimental results indicate that the model maintains high detection accuracy with reduced parameter and computational requirements. Compared to YOLOv8n, MSG-YOLO achieves a 48.74% reduction in parameter count and a 24.17% reduction in computational load, while improving the mAP0.5 score by 2.86%, reaching 93.64%. This algorithm strikes a balance between accuracy and lightweight design, offering efficient and reliable clubbing finger detection even in resource-constrained environments. Full article
Show Figures

Figure 1

Figure 1
<p>The overall architecture of the improved YOLO model, comprising three primary components: Backbone, Neck, and Head. The Backbone module extracts multi-level features from the input image, the Neck module merges and processes multi-scale features, and the Head module produces the target detection results. The SPPF module aggregates multi-scale information through repeated MaxPool2d operations and concatenation, and the Conv module consists of Conv2d, BatchNorm2d, and the SiLU activation function.</p>
Full article ">Figure 2
<p>Structure diagram of the C2f_MDR module. (<b>a</b>) The overall structure of the C2f_MDR module. The input feature map undergoes convolution, segmentation, and multiple MDR module processes, followed by concatenation and convolution to restore the channel number and generate the output feature map. (<b>b</b>) The detailed structure of the MDR module, which consists of three convolutional layers with different dilation rates (dilation rates of 1, 3, and 5) to capture contextual information at different scales.</p>
Full article ">Figure 3
<p>Schematic diagram of the SFFPN module. In the Feature Selection module, channel attention (CA) and element-wise multiplication (⊗) are used to adaptively adjust the weights of the input features, followed by processing with a convolution layer (kernel size k = 1). The Feature Selection Fusion module then performs upsampling of features at different scales through convolution transpose (ConvTranspose), followed by concatenation (Concat) to fuse multi-scale features. The fused feature map is subsequently passed into the C2f module for further optimization, providing higher-quality features for object detection.</p>
Full article ">Figure 4
<p>Structure diagram of the GNSCD module. Feature maps at different scales are first processed using group normalization convolution (Conv_GN), followed by shared convolution layers (with kernel sizes k = 1 and k = 5) to extract multi-scale features, thereby enhancing the model’s ability to detect objects at different scales.</p>
Full article ">Figure 5
<p>Dataset example: image (<b>a</b>) displays a clubbed finger, while image (<b>b</b>) shows a normal finger. Characteristics of clubbed fingers include nail-fold angles greater than 180°, whereas normal fingers generally have nail-fold angles less than 180°.</p>
Full article ">Figure 6
<p>Comparison of detection results between MSG-YOLO and YOLOv8n models. (<b>a</b>) Performance of the MSG-YOLO model in the clubbed finger detection task. (<b>b</b>) Results from the YOLOv8n model on the same task.</p>
Full article ">Figure 7
<p>Detection Figure results of MSG-YOLO on different samples. The left side (<b>a</b>) shows the bounding boxes for normal and clubbed finger samples, with each box labeling the detected category and confidence score. The right side (<b>b</b>) presents the corresponding heatmaps, highlighting the areas of focus for the model during detection. The high-intensity red and yellow regions indicate significant features that the model has identified in these areas.</p>
Full article ">
17 pages, 12206 KiB  
Article
Smart Monitoring Method for Land-Based Sources of Marine Outfalls Based on an Improved YOLOv8 Model
by Shicheng Zhao, Haolan Zhou and Haiyan Yang
Water 2024, 16(22), 3285; https://doi.org/10.3390/w16223285 - 15 Nov 2024
Viewed by 436
Abstract
Land-based sources of marine outfalls are a major source of marine pollution. The monitoring of land-based sources of marine outfalls is an important means for marine environmental protection and governance. Traditional on-site manual monitoring methods are inefficient, expensive, and constrained by geographic conditions. [...] Read more.
Land-based sources of marine outfalls are a major source of marine pollution. The monitoring of land-based sources of marine outfalls is an important means for marine environmental protection and governance. Traditional on-site manual monitoring methods are inefficient, expensive, and constrained by geographic conditions. Satellite remote sensing spectral analysis methods can only identify pollutant plumes and are affected by discharge timing and cloud/fog interference. Therefore, we propose a smart monitoring method for land-based sources of marine outfalls based on an improved YOLOv8 model, using unmanned aerial vehicles (UAVs). This method can accurately identify and classify marine outfalls, offering high practical application value. Inspired by the sparse sampling method in compressed sensing, we incorporated a multi-scale dilated attention mechanism into the model and integrated dynamic snake convolutions into the C2f module. This approach enhanced the model’s detection capability for occluded and complex-feature targets while constraining the increase in computational load. Additionally, we proposed a new loss calculation method by combining Inner-IoU (Intersection over Union) and MPDIoU (IoU with Minimum Points Distance), which further improved the model’s regression speed and its ability to predict multi-scale targets. The final experimental results show that the improved model achieved an mAP50 (mean Average Precision at 50) of 87.0%, representing a 3.4% increase from the original model, effectively enabling the smart monitoring of land-based marine discharge outlets. Full article
(This article belongs to the Section Oceans and Coastal Zones)
Show Figures

Figure 1

Figure 1
<p>Zhanjiang city outlets point map. (<b>a</b>) “gully”, (<b>b</b>) “weir”, (<b>c</b>) “pipe”, (<b>d</b>) “culvert”, (<b>e</b>) “gully”, (<b>f</b>) “weir”, (<b>g</b>) “pipe”, (<b>h</b>) “culvert”.</p>
Full article ">Figure 2
<p>YOLOv8 model structure.</p>
Full article ">Figure 3
<p>MSDA mechanism structure. The red points represent the key positions of the convolutional kernel, the yellow area shows the dilation of the kernel at <math display="inline"><semantics> <mrow> <mi mathvariant="normal">r</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>, the blue area shows the dilation at <math display="inline"><semantics> <mrow> <mi mathvariant="normal">r</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>, and the green area shows the dilation at <math display="inline"><semantics> <mrow> <mi mathvariant="normal">r</mi> <mo>=</mo> <mn>3</mn> </mrow> </semantics></math>.</p>
Full article ">Figure 4
<p>C2f module structure.</p>
Full article ">Figure 5
<p>DSConv selectable receptive fields. The blue line represents the continuous shift of the convolutional kernel in the horizontal direction, while the red line represents the continuous shift of the convolutional kernel in the vertical direction.</p>
Full article ">Figure 6
<p>Inner-MPDIoU diagram.</p>
Full article ">Figure 7
<p>(<b>a</b>) Anchor box category number statistics, (<b>b</b>) Anchor box position statistics. The color of Anchor box in (<b>b</b>) belongs to the same category as that in (<b>a</b>).</p>
Full article ">Figure 8
<p>(<b>a</b>) Normalized confusion matrices for YOLOv8 model, (<b>b</b>) normalized confusion matrices for YOLOv8+MSDA model.</p>
Full article ">Figure 9
<p>(<b>a</b>) YOLOv8 model’s predicted results, (<b>b</b>) our model’s predicted results.</p>
Full article ">Figure 10
<p>(<b>a</b>) P–R curve of the improved model, (<b>b</b>) P–R curve of the improved model after transfer learning.</p>
Full article ">Figure 11
<p>Model training process.</p>
Full article ">
20 pages, 11655 KiB  
Article
Variational Color Shift and Auto-Encoder Based on Large Separable Kernel Attention for Enhanced Text CAPTCHA Vulnerability Assessment
by Xing Wan, Juliana Johari and Fazlina Ahmat Ruslan
Information 2024, 15(11), 717; https://doi.org/10.3390/info15110717 - 7 Nov 2024
Viewed by 472
Abstract
Text CAPTCHAs are crucial security measures deployed on global websites to deter unauthorized intrusions. The presence of anti-attack features incorporated into text CAPTCHAs limits the effectiveness of evaluating them, despite CAPTCHA recognition being an effective method for assessing their security. This study introduces [...] Read more.
Text CAPTCHAs are crucial security measures deployed on global websites to deter unauthorized intrusions. The presence of anti-attack features incorporated into text CAPTCHAs limits the effectiveness of evaluating them, despite CAPTCHA recognition being an effective method for assessing their security. This study introduces a novel color augmentation technique called Variational Color Shift (VCS) to boost the recognition accuracy of different networks. VCS generates a color shift of every input image and then resamples the image within that range to generate a new image, thus expanding the number of samples of the original dataset to improve training effectiveness. In contrast to Random Color Shift (RCS), which treats the color offsets as hyperparameters, VCS estimates color shifts by reparametrizing the points sampled from the uniform distribution using predicted offsets according to every image, which makes the color shifts learnable. To better balance the computation and performance, we also propose two variants of VCS: Sim-VCS and Dilated-VCS. In addition, to solve the overfitting problem caused by disturbances in text CAPTCHAs, we propose an Auto-Encoder (AE) based on Large Separable Kernel Attention (AE-LSKA) to replace the convolutional module with large kernels in the text CAPTCHA recognizer. This new module employs an AE to compress the interference while expanding the receptive field using Large Separable Kernel Attention (LSKA), reducing the impact of local interference on the model training and improving the overall perception of characters. The experimental results show that the recognition accuracy of the model after integrating the AE-LSKA module is improved by at least 15 percentage points on both M-CAPTCHA and P-CAPTCHA datasets. In addition, experimental results demonstrate that color augmentation using VCS is more effective in enhancing recognition, which has higher accuracy compared to RCS and PCA Color Shift (PCA-CS). Full article
(This article belongs to the Special Issue Computer Vision for Security Applications)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>The structure and training process of Deep-CAPTCHA.</p>
Full article ">Figure 2
<p>Adaptive-CAPTCHA improved on Deep-CAPTCHA.</p>
Full article ">Figure 3
<p>Samples of P-CAPTCHA.</p>
Full article ">Figure 4
<p>Samples of M-CAPTCHA.</p>
Full article ">Figure 5
<p>The flowchart of VCS.</p>
Full article ">Figure 6
<p>The difference between VCS and Sim-VCS.</p>
Full article ">Figure 7
<p>The dilated convolution-based sampling of Dilated-VCS.</p>
Full article ">Figure 8
<p>The AE-LSKAs with different dilated kernels and receptive fields.</p>
Full article ">Figure 9
<p>AASR comparison between VCS and RCS using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 10
<p>AASR comparison between VCS and RCS using Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 11
<p>AASR comparison between VCS and PCA-CS using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 12
<p>AASR comparison between VCS and PCA-CS using Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 13
<p>Learning process AE-LSKAs with different dilated kernels integrating into Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 14
<p>Validation accuracy comparison of AE-LSKAs with different dilated kernels integrated into Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 15
<p>Validation accuracy comparison of AE-LSKAs with different kernels integrated into Deep-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">Figure 16
<p>Individual character accuracy comparison using Adaptive-CAPTCHA: (<b>a</b>) M-CAPTCHA; (<b>b</b>) P-CAPTCHA.</p>
Full article ">
20 pages, 7344 KiB  
Article
Research on a Joint Extraction Method of Track Circuit Entities and Relations Integrating Global Pointer and Tensor Learning
by Yanrui Chen, Guangwu Chen and Peng Li
Sensors 2024, 24(22), 7128; https://doi.org/10.3390/s24227128 - 6 Nov 2024
Viewed by 415
Abstract
To address the issue of efficiently reusing the massive amount of unstructured knowledge generated during the handling of track circuit equipment faults and to automate the construction of knowledge graphs in the railway maintenance domain, it is crucial to leverage knowledge extraction techniques [...] Read more.
To address the issue of efficiently reusing the massive amount of unstructured knowledge generated during the handling of track circuit equipment faults and to automate the construction of knowledge graphs in the railway maintenance domain, it is crucial to leverage knowledge extraction techniques to efficiently extract relational triplets from fault maintenance text data. Given the current lag in joint extraction technology within the railway domain and the inefficiency in resource utilization, this paper proposes a joint extraction model for track circuit entities and relations, integrating Global Pointer and tensor learning. Taking into account the associative characteristics of semantic relations, the nesting of domain-specific terms in the railway sector, and semantic diversity, this research views the relation extraction task as a tensor learning process and the entity recognition task as a span-based Global Pointer search process. First, a multi-layer dilate gated convolutional neural network with residual connections is used to extract key features and fuse the weighted information from the 12 different semantic layers of the RoBERTa-wwm-ext model, fully exploiting the performance of each encoding layer. Next, the Tucker decomposition method is utilized to capture the semantic correlations between relations, and an Efficient Global Pointer is employed to globally predict the start and end positions of subject and object entities, incorporating relative position information through rotary position embedding (RoPE). Finally, comparative experiments with existing mainstream joint extraction models were conducted, and the proposed model’s excellent performance was validated on the English public datasets NYT and WebNLG, the Chinese public dataset DuIE, and a private track circuit dataset. The F1 scores on the NYT, WebNLG, and DuIE public datasets reached 92.1%, 92.7%, and 78.2%, respectively. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

Figure 1
<p>Example of overlapping relations.</p>
Full article ">Figure 2
<p>The structure of the joint extraction model for track circuit entities and relations integrates Global Pointer and tensor learning.</p>
Full article ">Figure 3
<p>The structure of a multi-layer dilate gated convolutional neural network.</p>
Full article ">Figure 4
<p>Example of how to construct a three-dimension word relation tensor from word tables.</p>
Full article ">Figure 5
<p>Knowledge association structure diagram.</p>
Full article ">Figure 6
<p>The results of different methods on the track circuit validation set.</p>
Full article ">Figure 7
<p>The experimental results using different upstream models on the track circuit test set.</p>
Full article ">Figure 8
<p>The triplet extraction performance under different dimensions of the core tensor <math display="inline"><semantics> <mi mathvariant="script">G</mi> </semantics></math>.</p>
Full article ">Figure 9
<p>The parameter-tuning experiment for <math display="inline"><semantics> <mi>α</mi> </semantics></math>.</p>
Full article ">Figure 10
<p>The parameter-tuning experiment for <math display="inline"><semantics> <mi>γ</mi> </semantics></math>.</p>
Full article ">Figure 11
<p>The model extracts entity types from case sentences.</p>
Full article ">Figure 12
<p>The model extracts relation types from case sentences.</p>
Full article ">
22 pages, 10749 KiB  
Article
Research on Fault Diagnosis of Rotating Parts Based on Transformer Deep Learning Model
by Zilin Zhang, Yaohua Deng, Xiali Liu and Jige Liao
Appl. Sci. 2024, 14(22), 10095; https://doi.org/10.3390/app142210095 - 5 Nov 2024
Viewed by 481
Abstract
The rotating parts of large and complex equipment are key components that ensure the normal operation of the equipment. Accurate fault diagnosis is crucial for the safe operation of these systems. To simultaneously extract both local and global valuable fault feature information from [...] Read more.
The rotating parts of large and complex equipment are key components that ensure the normal operation of the equipment. Accurate fault diagnosis is crucial for the safe operation of these systems. To simultaneously extract both local and global valuable fault feature information from key components of complex equipment, this study proposes a fault diagnosis network model, named MultiDilatedFormer, which is based on the fusion of transformer and multi-head dilated convolution. The newly designed multi-head dilated convolution module is sequentially integrated into the transformer-encoder architecture, constructing a feature extraction module where the complementary advantages of both components enhance overall performance. Firstly, the sample is expanded into a two-dimensional feature map and then input into the newly designed feature extraction module. Finally, the diagnostic output is performed by the designed patch feature fusion module and classifier module. Additionally, interpretability research is conducted on the proposed model, aiming to understand the decision-making mechanism of the model through visual analysis of the entire decision process. The experimental results on three different datasets indicate that the proposed model achieved high accuracy in fault diagnosis with relatively short data windows. The highest accuracy reached 97.95%, which was up to 10.97% higher than other models. Furthermore, the feasibility of the model is also verified in the actual dataset of the rotating parts of the injection molding machine. The excellent performance of the model on different datasets demonstrates its effectiveness in extracting comprehensive fault feature information and also proves its great potential in practical industrial applications. Full article
Show Figures

Figure 1

Figure 1
<p>MultiDilatedFormer model framework.</p>
Full article ">Figure 2
<p>Schematic diagram of data sliding window sampling.</p>
Full article ">Figure 3
<p>Multi-head self-attention layer.</p>
Full article ">Figure 4
<p>Multi-head dilated convolutional layer.</p>
Full article ">Figure 5
<p>Global average pooling layer.</p>
Full article ">Figure 6
<p>The accuracy curves of XJTU-SY dataset.</p>
Full article ">Figure 7
<p>Normalized confusion matrix of XJTU-SY dataset: (<b>a</b>) WDCNN; (<b>b</b>) DRCNN; (<b>c</b>) DialetedNN; (<b>d</b>) Vision Transformer; (<b>e</b>) MultiDilatedFormer.</p>
Full article ">Figure 8
<p>The t-SNE cluster diagram of XJTU-SY dataset: (<b>a</b>) WDCNN; (<b>b</b>) DRCNN; (<b>c</b>) DialetedNN; (<b>d</b>) Vision Transformer; (<b>e</b>) MultiDilatedFormer (the red circle in the figure represents the mixed parts).</p>
Full article ">Figure 9
<p>Accuracy boxplot of the CWRU dataset.</p>
Full article ">Figure 10
<p>Normalized confusion matrix of CWRU dataset: (<b>a</b>) WDCNN; (<b>b</b>) DRCNN; (<b>c</b>) DialetedNN; (<b>d</b>) Vision Transformer; (<b>e</b>) MultiDilatedFormer.</p>
Full article ">Figure 11
<p>The t-SNE cluster diagram of CWRU dataset: (<b>a</b>) WDCNN; (<b>b</b>) DRCNN; (<b>c</b>) DialetedNN; (<b>d</b>) Vision Transformer; (<b>e</b>) MultiDilatedFormer.</p>
Full article ">Figure 12
<p>Model decision-making process visualization: (<b>a</b>) input sample, size: (1, 105); (<b>b</b>) after expand, size: (105, 105); (<b>c</b>) input embedding, size: (25, 512); (<b>d</b>) positional encoding, size:(25, 512); (<b>e</b>) multi-attention, size: (25, 512); (<b>f</b>) Multi−DilatedConv, size: (25, 512); (<b>g</b>) patch fusion, size: (1, 512); (<b>h</b>) output, size: (1, 10). The left is a bar chart, and the right is a heat map.</p>
Full article ">Figure 13
<p>Visualization of multi-head mechanism: (<b>a</b>) query of MSL; (<b>b</b>) key of MSL; (<b>c</b>) value of MSL; (<b>d</b>) D1 after MDL.</p>
Full article ">Figure 13 Cont.
<p>Visualization of multi-head mechanism: (<b>a</b>) query of MSL; (<b>b</b>) key of MSL; (<b>c</b>) value of MSL; (<b>d</b>) D1 after MDL.</p>
Full article ">Figure 14
<p>Three damaged small components (check valve, check ring): (<b>a</b>) check valve; (<b>b</b>) check ring.</p>
Full article ">Figure 15
<p>Visualization of fault data.</p>
Full article ">Figure 16
<p>The t-SNE cluster diagram of actual scene dataset.</p>
Full article ">
16 pages, 2958 KiB  
Article
Heart Rate Estimation Algorithm Integrating Long and Short-Term Temporal Features
by Jie Sun, Zhanwang Zhang, Jiaqi Liu, Lijian Zhou and Songtao Hu
Mathematics 2024, 12(21), 3444; https://doi.org/10.3390/math12213444 - 4 Nov 2024
Viewed by 744
Abstract
Non-contact heart rate monitoring from facial videos utilizing remote photoplethysmography (rPPG) has gained significant traction in remote health monitoring. Given that rPPG captures the dynamic blood flow within the human body and constitutes a time-series signal characterized by periodic properties, this study introduced [...] Read more.
Non-contact heart rate monitoring from facial videos utilizing remote photoplethysmography (rPPG) has gained significant traction in remote health monitoring. Given that rPPG captures the dynamic blood flow within the human body and constitutes a time-series signal characterized by periodic properties, this study introduced a three-dimensional convolutional neural network (3D CNN) designed to simultaneously address long-term periodic and short-term temporal characteristics for effective rPPG signal extraction. Firstly, differential operations are employed to preprocess video data, enhancing the face’s dynamic features. Secondly, building upon the 3D CNN framework, multi-scale dilated convolutions and self-attention mechanisms were integrated to enhance the model’s temporal modeling capabilities further. Finally, interpolation techniques are applied to refine the heart rate calculation methodology. The experiments conducted on the UBFC-rPPG dataset indicate that, compared with the existing optimal algorithm, the average absolute error (MAE) and the root mean square error (RMSE) realized significant enhancements of approximately 28% and 35%. Additionally, through comprehensive analyses such as cross-dataset experiments and complexity analyses, the validity and stability of the proposed algorithm in the task of heart rate estimation were manifested. Full article
Show Figures

Figure 1

Figure 1
<p>The overall framework of the algorithm.</p>
Full article ">Figure 2
<p>The model diagram of Long-Short Term Attention rPPG Network (LSTA-rPPGNet), where Conv3D Block represents a sequence of 3D convolutional blocks (3D convolution, batch normalization, and ReLU activation function). Maxpool-3D represents 3D max pooling, LST-SAM represents Long-Short Term Self-Attention Module, and ConvTrans3D represents 3D transposed convolution.</p>
Full article ">Figure 3
<p>Long Short-Term Self-Attention Module (LST-SAM).</p>
Full article ">Figure 4
<p>Video length comparison experiment. (The downward arrows next to MAE and RMSE in the legend indicate that lower values are better, meaning that smaller error values represent improved performance).</p>
Full article ">Figure 5
<p>Facial region preprocessing.</p>
Full article ">Figure 6
<p>Comparison of resolution size. (The downward arrows next to MAE and RMSE in the legend indicate that lower values are better, meaning that smaller error values represent improved performance).</p>
Full article ">Figure 7
<p>Comparison of predicted HRs with ground-truth HRs for testing data. The <span class="html-italic">x</span>-axis represents the predicted heart rate (Predicted HR (bpm)), and the <span class="html-italic">y</span>-axis represents the actual heart rate (Ground truth HR (bpm)).</p>
Full article ">Figure 8
<p>Comparison of predicted rPPG signal (blue) and ground truth signal (red): (<b>a</b>) obtained from subject 4; (<b>b</b>) obtained from subject 10.</p>
Full article ">
15 pages, 13201 KiB  
Article
Quantifying Shape and Texture Biases for Enhancing Transfer Learning in Convolutional Neural Networks
by Akinori Iwata and Masahiro Okuda
Signals 2024, 5(4), 721-735; https://doi.org/10.3390/signals5040040 - 4 Nov 2024
Viewed by 497
Abstract
Neural networks have inductive biases owing to the assumptions associated with the selected learning algorithm, datasets, and network structure. Specifically, convolutional neural networks (CNNs) are known for their tendency to exhibit textural biases. This bias is closely related to image classification accuracy. Aligning [...] Read more.
Neural networks have inductive biases owing to the assumptions associated with the selected learning algorithm, datasets, and network structure. Specifically, convolutional neural networks (CNNs) are known for their tendency to exhibit textural biases. This bias is closely related to image classification accuracy. Aligning the model’s bias with the dataset’s bias can significantly enhance performance in transfer learning, leading to more efficient learning. This study aims to quantitatively demonstrate that increasing shape bias within the network by varying kernel sizes and dilation rates improves accuracy on shape-dominant data and enables efficient learning with less data. Furthermore, we propose a novel method for quantitatively evaluating the balance between texture bias and shape bias. This method enables efficient learning by aligning the biases of the transfer learning dataset with those of the model. Systematically adjusting these biases allows CNNs to better fit data with specific biases. Compared to the original model, an accuracy improvement of up to 9.9% was observed. Our findings underscore the critical role of bias adjustment in CNN design, contributing to developing more efficient and effective image classification models. Full article
Show Figures

Figure 1

Figure 1
<p>The combined shape/texture images used to calculate the bias metric (on the left side) included a shape-dominant image in the upper part and a texture-dominant image in the lower part. This combined image is used for transfer learning in a two-class classification. Subsequently, these test-combined images are input into the model, and the shape/texture bias is calculated through gradient-weighted class activation mapping (Grad-CAM).</p>
Full article ">Figure 2
<p>Results of Grad-CAM visualization used with the proposed shape/texture bias metric. In the case of the original ResNeXt (a texture-biased model), the heat map of the lower image (which is texture-dominant) turns red, indicating that the model is focusing on it. On the other hand, for the ResNeXt with a dilation rate of three (a shape-biased model), the heat map of the upper image (which is shape-dominant) turns red, indicating its focus. This demonstrates that simply increasing the dilation rate results in a stronger bias towards shapes in the model. (<b>a</b>) The visualization images were obtained by applying Grad-CAM to the original ResNeXt (texture-biased model). (<b>b</b>) The visualization images were obtained by applying Grad-CAM to ResNeXt with a dilation of three (shape-biased model).</p>
Full article ">Figure 3
<p>Results of limiting the training data for each dataset. The accuracy rate is defined as the accuracy achieved with limited training data divided by the accuracy achieved with the entire dataset. The data ratio represents the proportion of the data used for training. (<b>a</b>) Results of reducing the amount of training data in the Logo dataset. (<b>b</b>) Results of reducing the amount of training data in the Cartoon dataset. (<b>c</b>) Results of reducing the amount of training data in the Sketch dataset.</p>
Full article ">Figure 3 Cont.
<p>Results of limiting the training data for each dataset. The accuracy rate is defined as the accuracy achieved with limited training data divided by the accuracy achieved with the entire dataset. The data ratio represents the proportion of the data used for training. (<b>a</b>) Results of reducing the amount of training data in the Logo dataset. (<b>b</b>) Results of reducing the amount of training data in the Cartoon dataset. (<b>c</b>) Results of reducing the amount of training data in the Sketch dataset.</p>
Full article ">
16 pages, 8285 KiB  
Technical Note
A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction
by Jiaxin Huang, Huicong Wang, Yuhan Li and Shijian Liu
Remote Sens. 2024, 16(21), 4033; https://doi.org/10.3390/rs16214033 - 30 Oct 2024
Viewed by 458
Abstract
Image super-resolution (SR) algorithms based on deep learning yield good visual performances on visible images. Due to the blurred edges and low contrast of infrared (IR) images, methods transferred directly from visible images to IR images have a poor performance and ignore the [...] Read more.
Image super-resolution (SR) algorithms based on deep learning yield good visual performances on visible images. Due to the blurred edges and low contrast of infrared (IR) images, methods transferred directly from visible images to IR images have a poor performance and ignore the demands of downstream detection tasks. Therefore, an Inception Dilated Super-Resolution (IDSR) network with multiple branches is proposed. A dilated convolutional branch captures high-frequency information to reconstruct edge details, while a non-local operation branch captures long-range dependencies between any two positions to maintain the global structure. Furthermore, deformable convolution is utilized to fuse features extracted from different branches, enabling adaptation to targets of various shapes. To enhance the detection performance of low-resolution (LR) images, we crop the images into patches based on target labels before feeding them to the network. This allows the network to focus on learning the reconstruction of the target areas only, reducing the interference of background areas in the target areas’ reconstruction. Additionally, a feature-driven module is cascaded at the end of the IDSR network to guide the high-resolution (HR) image reconstruction with feature prior information from a detection backbone. This method has been tested on the FLIR Thermal Dataset and the M3FD Dataset and compared with five mainstream SR algorithms. The final results demonstrate that our method effectively maintains image texture details. More importantly, our method achieves 80.55% mAP, outperforming other methods on FLIR Dataset detection accuracy, and with 74.7% mAP outperforms other methods on M3FD Dataset detection accuracy. Full article
Show Figures

Figure 1

Figure 1
<p>The overall structure of the proposed method, which mainly consists of three parts: a data preprocessing method to crop the images into patches, a SR reconstruction network to generate SR images and a feature-driven module to improve the detection accuracy.</p>
Full article ">Figure 2
<p>The architecture of the proposed ISDR for image super-resolution.</p>
Full article ">Figure 3
<p>The details of Inception Dilated Mixer (IDM).</p>
Full article ">Figure 4
<p>Frequency magnitude from 8 output channels of high-frequency extractor and low-frequency extractor.</p>
Full article ">Figure 5
<p>Super-resolution reconstruction results for LR images from the FLIR dataset (200 k iterations). Each two rows represent a scene, and from <b>top</b> to <b>bottom</b> are FLIR-08989 and FLIR-08951.</p>
Full article ">Figure 6
<p>The analysis of loss weight <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>λ</mi> <mo>,</mo> <mi>μ</mi> <mo>)</mo> </mrow> </semantics></math> selection in our method.</p>
Full article ">Figure 7
<p>Super-resolution reconstruction results for LR images from the FLIR dataset by feature-driven IDSR (our method, 300 k iterations). Each two rows represent a scene, and from <b>top</b> to <b>bottom</b> are FLIR-08989 and FLIR-08951.</p>
Full article ">Figure 8
<p>Object detection (YOLOv7) results for SR images from the FLIR dataset by feature-driven IDSR (our method, 300 k iterations). Each two rows represent a scene, and from <b>top</b> to <b>bottom</b> are FLIR-09401 and FLIR-09572.</p>
Full article ">
17 pages, 8754 KiB  
Article
Dq-YOLOF: An Effective Improvement with Deformable Convolution and Sample Quality Optimization Based on the YOLOF Detector
by Xiaoxia Qi, Md Gapar Md Johar, Ali Khatibi, Jacquline Tham and Long Cheng
Electronics 2024, 13(21), 4204; https://doi.org/10.3390/electronics13214204 - 27 Oct 2024
Viewed by 615
Abstract
Single-stage detectors have drawbacks of insufficient accuracy and poor coverage capability. YOLOF (You Only Look One-level Feature) has achieved better performance in this regard, but there is still room for improvement. To enhance the coverage capability for objects of different scales, we propose [...] Read more.
Single-stage detectors have drawbacks of insufficient accuracy and poor coverage capability. YOLOF (You Only Look One-level Feature) has achieved better performance in this regard, but there is still room for improvement. To enhance the coverage capability for objects of different scales, we propose an improved single-stage object detector: Dq-YOLOF. We have designed an output encoder that employs a series of modules utilizing deformable convolution and SimAM (Simple Attention Module). This module replaces the dilated convolution in YOLOF. This design significantly improves the ability to express details. Simultaneously, we have redefined the sample selection strategy, which optimizes the quality of positive samples based on SimOTA. It can dynamically allocate positive samples according to their quality, reducing computational load and making it more suitable for small objects. Experiments conducted on the COCO 2017 dataset also verify the effectiveness of our method. Dq-YOLOF achieved 38.7 AP, 1.5 AP higher than YOLOF. To confirm performance improvements on small objects, our method was tested on urinary sediment and aerial drone datasets for generalization. Notably, it enhances performance while also lowering computational costs. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) is the receptive field coverage of YOLOF’s original method, which can cover features at multiple scales; (<b>b</b>) is the improved method, which further expands on YOLOF’s original coverage and improves on both small and large objects.</p>
Full article ">Figure 2
<p>The figure shows the main process of our method, where we focused on improving the encoder and decoder.</p>
Full article ">Figure 3
<p>Diagrammatic representation of the dilated encoder structure. C5 is the input feature, and P5 is the output feature. The label 3 × 3 represents the dilated convolution, SimAM is the parameter-free attention, and DCN is the deformable convolution.</p>
Full article ">Figure 4
<p>The top is the box labeled for the original YOLOF sample selection and the bottom is the box labeled for the Dq-YOLOF sample selection.</p>
Full article ">Figure 5
<p>Number of labels per each category in the dataset.</p>
Full article ">Figure 6
<p>Number of small, medium, and large objects in the Urised11 dataset and COCO2017 dataset: the left is the Urised11 dataset, and the right is the COCO2017 dataset.</p>
Full article ">Figure 7
<p>Five structures compared with the original method, where (<b>a</b>) is the original method, Conv stands for dilated convolution, DCN stands for deformable convolution, and simAM stands for parameter-free attention. (<b>b</b>) employs parallel dilated convolution and deformable convolution, followed by a simAM module in series after each convolution. (<b>c</b>) replaces the dilated convolution in (<b>b</b>) with deformable convolution. (<b>d</b>) replaces the dilated convolution in a with deformable convolution. (<b>e</b>) combines both (<b>c</b>,<b>d</b>).</p>
Full article ">Figure 8
<p>Loss curves for YOLOF and Dq-YOLOF. The left is the regression loss and the right is the classification loss.</p>
Full article ">
17 pages, 18662 KiB  
Article
Symmetric Connected U-Net with Multi-Head Self Attention (MHSA) and WGAN for Image Inpainting
by Yanyang Hou, Xiaopeng Ma, Junjun Zhang and Chenxian Guo
Symmetry 2024, 16(11), 1423; https://doi.org/10.3390/sym16111423 - 25 Oct 2024
Viewed by 656
Abstract
This study presents a new image inpainting model based on U-Net and incorporating the Wasserstein Generative Adversarial Network (WGAN). The model uses skip connections to connect every encoder block to the corresponding decoder block, resulting in a strictly symmetrical architecture referred to as [...] Read more.
This study presents a new image inpainting model based on U-Net and incorporating the Wasserstein Generative Adversarial Network (WGAN). The model uses skip connections to connect every encoder block to the corresponding decoder block, resulting in a strictly symmetrical architecture referred to as Symmetric Connected U-Net (SC-Unet). By combining SC-Unet with a GAN, the study aims to reconstruct images more effectively and seamlessly. The traditional discriminators only differentiate the entire image as true or false. In this study, the discriminator calculated the probability of each pixel belonging to the hole and non-hole regions, which provided the generator with more gradient loss information for image inpainting. Additionally, every block of SC-Unet incorporated a Dilated Convolutional Neural Network (DCNN) to increase the receptive field of the convolutional layers. Our model also integrated Multi-Head Self-Attention (MHSA) into selected blocks to enable it to efficiently search the entire image for suitable content to fill the missing areas. This study adopts the publicly available datasets CelebA-HQ and ImageNet for evaluation. Our proposed algorithm demonstrates a 10% improvement in PSNR and a 2.94% improvement in SSIM compared to existing representative image inpainting methods in the experiment. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

Figure 1
<p>SC-Unet architecture, with a 3 × 256 × 256 image as input.</p>
Full article ">Figure 2
<p>(<b>left</b>) The original convolutional block of U-Net; (<b>right</b>) the Dilated Convolutional Neural Network in the convolutional block of SC-Unet.</p>
Full article ">Figure 3
<p>Multi-Head Self-Attention in the convolutional block of SC-Unet.</p>
Full article ">Figure 4
<p>The main framework of our image inpainting model.</p>
Full article ">Figure 5
<p>Sample of results with CelebA-HQ dataset. (<b>a</b>) Ground truth image. (<b>b</b>) Input image. (<b>c</b>) Reconstructed image. (<b>d</b>) Mask image. (<b>e</b>) Predicted image.</p>
Full article ">Figure 6
<p>Sample of results with ImageNet dataset. (<b>a</b>) Ground truth image. (<b>b</b>) Input image. (<b>c</b>) Reconstructed image. (<b>d</b>) Mask image. (<b>e</b>) Predicted image.</p>
Full article ">Figure 7
<p>Comparison with GLCIC and CA models on ImageNet.</p>
Full article ">Figure 8
<p>Comparison with GLCIC and CA models on CelebA-HQ.</p>
Full article ">Figure 9
<p>Ablation study with different model modules on ImageNet.</p>
Full article ">Figure 10
<p>Ablation study with different model modules on CelebA-HQ.</p>
Full article ">Figure A1
<p>Ablation study with different model modules on ImageNet.</p>
Full article ">Figure A2
<p>Ablation study with different model modules on ImageNet.</p>
Full article ">Figure A3
<p>Ablation study with different model modules on CelebA-HQ.</p>
Full article ">Figure A4
<p>Ablation study with different model modules on CelebA-HQ.</p>
Full article ">
23 pages, 5508 KiB  
Article
YOLO-DroneMS: Multi-Scale Object Detection Network for Unmanned Aerial Vehicle (UAV) Images
by Xueqiang Zhao and Yangbo Chen
Drones 2024, 8(11), 609; https://doi.org/10.3390/drones8110609 - 24 Oct 2024
Viewed by 1067
Abstract
In recent years, research on Unmanned Aerial Vehicles (UAVs) has developed rapidly. Compared to traditional remote-sensing images, UAV images exhibit complex backgrounds, high resolution, and large differences in object scales. Therefore, UAV object detection is an essential yet challenging task. This paper proposes [...] Read more.
In recent years, research on Unmanned Aerial Vehicles (UAVs) has developed rapidly. Compared to traditional remote-sensing images, UAV images exhibit complex backgrounds, high resolution, and large differences in object scales. Therefore, UAV object detection is an essential yet challenging task. This paper proposes a multi-scale object detection network, namely YOLO-DroneMS (You Only Look Once for Drone Multi-Scale Object), for UAV images. Targeting the pivotal connection between the backbone and neck, the Large Separable Kernel Attention (LSKA) mechanism is adopted with the Spatial Pyramid Pooling Factor (SPPF), where weighted processing of multi-scale feature maps is performed to focus more on features. And Attentional Scale Sequence Fusion DySample (ASF-DySample) is introduced to perform attention scale sequence fusion and dynamic upsampling to conserve resources. Then, the faster cross-stage partial network bottleneck with two convolutions (named C2f) in the backbone is optimized using the Inverted Residual Mobile Block and Dilated Reparam Block (iRMB-DRB), which balances the advantages of dynamic global modeling and static local information fusion. This optimization effectively increases the model’s receptive field, enhancing its capability for downstream tasks. By replacing the original CIoU with WIoUv3, the model prioritizes anchoring boxes of superior quality, dynamically adjusting weights to enhance detection performance for small objects. Experimental findings on the VisDrone2019 dataset demonstrate that at an Intersection over Union (IoU) of 0.5, YOLO-DroneMS achieves a 3.6% increase in mAP@50 compared to the YOLOv8n model. Moreover, YOLO-DroneMS exhibits improved detection speed, increasing the number of frames per second (FPS) from 78.7 to 83.3. The enhanced model supports diverse target scales and achieves high recognition rates, making it well-suited for drone-based object detection tasks, particularly in scenarios involving multiple object clusters. Full article
(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)
Show Figures

Figure 1

Figure 1
<p>YOLOv8 network structure.</p>
Full article ">Figure 2
<p>Model architecture of YOLO-DroneMS. SPPF-LSKA is adopted to perform weighted processing on multi-scale feature maps, while C2f-iRMB-DRB is used for balancing the advantages of dynamic global modeling and local information fusion. Additionally, ASF-DySample is utilized to dynamically upsample the neck section.</p>
Full article ">Figure 3
<p>Structure of LSKA and SPPF-LSKA. The LSKA mechanism is applied after SPPF to perform weighted processing on multi-scale feature maps.</p>
Full article ">Figure 4
<p>Point sampling of DySample. DySample dynamically adjusts sampling points by summing offsets with original grid positions.</p>
Full article ">Figure 5
<p>Architecture of the ASF-DySample, in which the DynamicScalSeq module merges the features from shallow feature maps by the Add operation.</p>
Full article ">Figure 6
<p>Structure of iRMB network. iRMB fuses lightweight CNN architectures and attention-based model structures.</p>
Full article ">Figure 7
<p>Structure of the iRMB-DRB network.</p>
Full article ">Figure 8
<p>Structure of the C2f-iRMB-DRB submodule. The iRMB-DRB network enhances the model’s capability to handle features at various scales.</p>
Full article ">Figure 9
<p>Total number of object instances in the VisDrone2019 dataset.</p>
Full article ">Figure 10
<p>Comparison of mAP@50 for different models on the VisDrone2019 dataset.</p>
Full article ">Figure 11
<p>Comparison of confusion matrices of YOLOv8 and YOLO-DroneMS. It is seen that the “pedestrian” class has an increase of 9%, and the “people” and “car” categories have improved by 8% and 7%, respectively.</p>
Full article ">Figure 11 Cont.
<p>Comparison of confusion matrices of YOLOv8 and YOLO-DroneMS. It is seen that the “pedestrian” class has an increase of 9%, and the “people” and “car” categories have improved by 8% and 7%, respectively.</p>
Full article ">Figure 12
<p>Visualization of the comparison of different models.</p>
Full article ">Figure 13
<p>Statistical plot of the classes in the RiverInspect-2024 dataset.</p>
Full article ">Figure 14
<p>Examples of RiverInspect-2024. It is used for a river drone-inspection project.</p>
Full article ">
14 pages, 2370 KiB  
Article
AMW-YOLOv8n: Road Scene Object Detection Based on an Improved YOLOv8
by Donghao Wu, Chao Fang, Xiaogang Zheng, Jue Liu, Shengchun Wang and Xinyu Huang
Electronics 2024, 13(20), 4121; https://doi.org/10.3390/electronics13204121 - 19 Oct 2024
Viewed by 797
Abstract
This study introduces an improved YOLOv8 model tailored for detecting objects in road scenes. To overcome the limitations of standard convolution operations in adapting to varying targets, we introduce Adaptive Kernel Convolution (AKconv). AKconv dynamically adjusts the convolution kernel’s shape and size, enhancing [...] Read more.
This study introduces an improved YOLOv8 model tailored for detecting objects in road scenes. To overcome the limitations of standard convolution operations in adapting to varying targets, we introduce Adaptive Kernel Convolution (AKconv). AKconv dynamically adjusts the convolution kernel’s shape and size, enhancing the backbone network’s feature extraction capabilities and improving feature representation across different scales. Additionally, we employ a Multi-Scale Dilated Attention (MSDA) mechanism to focus on key target features, further enhancing feature representation. To address the challenge posed by YOLOv8’s large down sampling factor, which limits the learning of small target features in deeper feature maps, we add a small target detection layer. Finally, to improve model training efficiency, we introduce a regression loss function with a Wise-IoU dynamic non-monotonic focusing mechanism. With these enhancements, our improved YOLOv8 model excels in road scene object detection tasks, achieving a 5.6 percentage point improvement in average precision over the original YOLOv8n on real road datasets. Full article
Show Figures

Figure 1

Figure 1
<p>AMW-YOLOv8n network diagram.</p>
Full article ">Figure 2
<p>Schematic diagram of AKconv.</p>
Full article ">Figure 3
<p>Diagram of Multi-Scale Dilated Attention convolution (MSDA) structure. The red square in the figure represents the query patch, while the surrounding colored patches are the representative patches selected in the Sliding Window Dilated Attention operation.</p>
Full article ">Figure 4
<p>Comparison of training results between YOLOv8n and AMW-YOLOv8n.</p>
Full article ">Figure 5
<p>Detection effect comparison. (<b>a</b>) Detection results of YOLOv8n algorithm. (<b>b</b>) Detection results of AMW-YOLOv8n.</p>
Full article ">Figure 6
<p>Comparative results of mean average precision (mAP) on different targets.</p>
Full article ">
Back to TopTop