Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (414)

Search Parameters:
Keywords = Swin-Transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 6303 KiB  
Article
Progressive Transmission Line Image Transmission and Recovery Algorithm Based on Hybrid Attention and Feature Fusion for Signal-Free Regions of Transmission Lines
by Xiu Ji, Xiao Yang, Zheyu Yue, Hongliu Yang and Haiyang Guo
Electronics 2024, 13(23), 4605; https://doi.org/10.3390/electronics13234605 - 22 Nov 2024
Viewed by 120
Abstract
In this paper, a progressive image transmission and recovery algorithm based on hybrid attention mechanism and feature fusion is proposed, aiming to solve the challenge of monitoring the signal-less region of transmission lines. The method combines wavelet transform, Swin Transformer, and hybrid attention [...] Read more.
In this paper, a progressive image transmission and recovery algorithm based on hybrid attention mechanism and feature fusion is proposed, aiming to solve the challenge of monitoring the signal-less region of transmission lines. The method combines wavelet transform, Swin Transformer, and hybrid attention module with the Pixel Shuffle upsampling mechanism to achieve a balance between quality and efficiency of image transmission in a low bandwidth environment. Initial preview is achieved by prioritizing the transmission of low-frequency subbands through wavelet transform, followed by dynamic optimization of the weight allocation of key features using a hybrid attention and local window multi-scale self-attention mechanism, and further enhancement of the resolution of the decoded image through Pixel Shuffle upsampling. Experimental results show that the algorithm significantly outperforms existing methods in terms of image quality (PSNR, SSIM), transmission efficiency, and bandwidth utilization, proving its superior adaptability and effectiveness in surveillance scenarios in signal-free regions. Full article
Show Figures

Figure 1

Figure 1
<p>Hybrid attention module.</p>
Full article ">Figure 2
<p>Swin Transformer module (<b>a</b>) Swin Transformer structure (<b>b</b>) Feed Forward structure.</p>
Full article ">Figure 3
<p>Pixel Shuffle upsampling module.</p>
Full article ">Figure 4
<p>Network structure.</p>
Full article ">Figure 5
<p>Images taken by the group in a signal-free area in Jilin.</p>
Full article ">Figure 6
<p>Comparison of feature extraction between traditional algorithm and this paper’s algorithm.</p>
Full article ">Figure 7
<p>Visualization of image recovery by different methods.</p>
Full article ">Figure 8
<p>The dataset of images taken by the group in the unsignalized area of Jilin Province was recovered by different methods and then inputted into YOLOv10 for detection results.</p>
Full article ">Figure 8 Cont.
<p>The dataset of images taken by the group in the unsignalized area of Jilin Province was recovered by different methods and then inputted into YOLOv10 for detection results.</p>
Full article ">Figure 9
<p>Visualization of image recovery by different methods. (<b>a</b>–<b>d</b>) is method Progressive Disentangling; (<b>e</b>–<b>h</b>) is method MPRNet; (<b>i</b>–<b>l</b>) is method Methodology of this paper.</p>
Full article ">Figure 9 Cont.
<p>Visualization of image recovery by different methods. (<b>a</b>–<b>d</b>) is method Progressive Disentangling; (<b>e</b>–<b>h</b>) is method MPRNet; (<b>i</b>–<b>l</b>) is method Methodology of this paper.</p>
Full article ">
12 pages, 6649 KiB  
Article
Masked Image Modeling Meets Self-Distillation: A Transformer-Based Prostate Gland Segmentation Framework for Pathology Slides
by Haoyue Zhang, Sushant Patkar, Rosina Lis, Maria J. Merino, Peter A. Pinto, Peter L. Choyke, Baris Turkbey and Stephanie Harmon
Cancers 2024, 16(23), 3897; https://doi.org/10.3390/cancers16233897 - 21 Nov 2024
Viewed by 220
Abstract
Detailed evaluation of prostate cancer glands is an essential yet labor-intensive step in grading prostate cancer. Gland segmentation can serve as a valuable preliminary step for machine-learning-based downstream tasks, such as Gleason grading, patient classification, cancer biomarker building, and survival analysis. Despite its [...] Read more.
Detailed evaluation of prostate cancer glands is an essential yet labor-intensive step in grading prostate cancer. Gland segmentation can serve as a valuable preliminary step for machine-learning-based downstream tasks, such as Gleason grading, patient classification, cancer biomarker building, and survival analysis. Despite its importance, there is currently a lack of a reliable gland segmentation model for prostate cancer. Without accurate gland segmentation, researchers rely on cell-level or human-annotated regions of interest for pathomic and deep feature extraction. This approach is sub-optimal, as the extracted features are not explicitly tailored to gland information. Although foundational segmentation models have gained a lot of interest, we demonstrated the limitations of this approach. This work proposes a prostate gland segmentation framework that utilizes a dual-path Swin Transformer UNet structure and leverages Masked Image Modeling for large-scale self-supervised pretaining. A tumor-guided self-distillation step further fused the binary tumor labels of each patch to the encoder to ensure the encoders are suitable for the gland segmentation step. We united heterogeneous data sources for self-supervised training, including biopsy and surgical specimens, to reflect the diversity of benign and cancerous pathology features. We evaluated the segmentation performance on two publicly available prostate cancer datasets. We achieved state-of-the-art segmentation performance with a test mDice of 0.947 on the PANDA dataset and a test mDice of 0.664 on the SICAPv2 dataset. Full article
(This article belongs to the Section Methods and Technologies Development)
Show Figures

Figure 1

Figure 1
<p>Sample slides from the three data cohorts. The top slide is from SICAPv2. Note that the SICAPv2 dataset is provided in a patch form, so the sample shown in this figure was stitched back based on the given coordinates. The bottom-left slide is from the PANDA cohort. The bottom-right slide is a whole-mount slide from our in-house dataset NCI.</p>
Full article ">Figure 2
<p>Overview of the proposed model for prostate gland segmentation. Section (<b>A</b>) shows the architecture of our proposed dual-path segmentation architecture. Section (<b>B</b>) shows our preprocessing, self-supervised learning, and self-distillation schema for the self-supervised learning step.</p>
Full article ">Figure 3
<p>Sample segmentation results for different Gleason grade glands across different methods. Compared with other methods, many small spots were removed by the tumor classification head in our network, which yielded a better visual representation without any post-processing smoothing methods.</p>
Full article ">
21 pages, 12271 KiB  
Article
Detection of Marine Oil Spill from PlanetScope Images Using CNN and Transformer Models
by Jonggu Kang, Chansu Yang, Jonghyuk Yi and Yangwon Lee
J. Mar. Sci. Eng. 2024, 12(11), 2095; https://doi.org/10.3390/jmse12112095 - 19 Nov 2024
Viewed by 357
Abstract
The contamination of marine ecosystems by oil spills poses a significant threat to the marine environment, necessitating the prompt and effective implementation of measures to mitigate the associated damage. Satellites offer a spatial and temporal advantage over aircraft and unmanned aerial vehicles (UAVs) [...] Read more.
The contamination of marine ecosystems by oil spills poses a significant threat to the marine environment, necessitating the prompt and effective implementation of measures to mitigate the associated damage. Satellites offer a spatial and temporal advantage over aircraft and unmanned aerial vehicles (UAVs) in oil spill detection due to their wide-area monitoring capabilities. While oil spill detection has traditionally relied on synthetic aperture radar (SAR) images, the combined use of optical satellite sensors alongside SAR can significantly enhance monitoring capabilities, providing improved spatial and temporal coverage. The advent of deep learning methodologies, particularly convolutional neural networks (CNNs) and Transformer models, has generated considerable interest in their potential for oil spill detection. In this study, we conducted a comprehensive and objective comparison to evaluate the suitability of CNN and Transformer models for marine oil spill detection. High-resolution optical satellite images were used to optimize DeepLabV3+, a widely utilized CNN model; Swin-UPerNet, a representative Transformer model; and Mask2Former, which employs a Transformer-based architecture for both encoding and decoding. The results of cross-validation demonstrate a mean Intersection over Union (mIoU) of 0.740, 0.840 and 0.804 for all the models, respectively, indicating their potential for detecting oil spills in the ocean. Additionally, we performed a histogram analysis on the predicted oil spill pixels, which allowed us to classify the types of oil. These findings highlight the considerable promise of the Swin Transformer models for oil spill detection in the context of future marine disaster monitoring. Full article
(This article belongs to the Special Issue Remote Sensing Applications in Marine Environmental Monitoring)
Show Figures

Figure 1

Figure 1
<p>Examples of image processing steps: (<b>a</b>) original satellite images, (<b>b</b>) images after gamma correction and histogram adjustment, and (<b>c</b>) labeled images.</p>
Full article ">Figure 2
<p>Flowchart of this study, illustrating the processes of labeling, modeling, optimization, and evaluation using the DeepLabV3+, Swin-UPerNet, and Mask2Former models [<a href="#B23-jmse-12-02095" class="html-bibr">23</a>,<a href="#B24-jmse-12-02095" class="html-bibr">24</a>,<a href="#B25-jmse-12-02095" class="html-bibr">25</a>].</p>
Full article ">Figure 3
<p>Concept of the 5-fold cross-validation in this study.</p>
Full article ">Figure 4
<p>Examples of image data augmentation using the Albumentations library. The example images include random 90-degree rotation, horizontal flip, vertical flip, optical distortion, grid distortion, RGB shift, and random brightness/contrast adjustment.</p>
Full article ">Figure 5
<p>Randomly selected examples from fold 1, including PlanetScope RGB images, segmentation labels, and predictions from DeepLabV3+ (DL), Swin-UPerNet (Swin), and Mask2Former (M2F).</p>
Full article ">Figure 6
<p>Randomly selected examples from fold 2, including PlanetScope RGB images, segmentation labels, and predictions from DeepLabV3+ (DL), Swin-UPerNet (Swin), and Mask2Former (M2F).</p>
Full article ">Figure 7
<p>Randomly selected examples from fold 3, including PlanetScope RGB images, segmentation labels, and predictions from DeepLabV3+ (DL), Swin-UPerNet (Swin), and Mask2Former (M2F).</p>
Full article ">Figure 8
<p>Randomly selected examples from fold 4, including PlanetScope RGB images, segmentation labels, and predictions from DeepLabV3+ (DL), Swin-UPerNet (Swin), and Mask2Former (M2F).</p>
Full article ">Figure 9
<p>Randomly selected examples from fold 5, including PlanetScope RGB images, segmentation labels, and predictions from DeepLabV3+ (DL), Swin-UPerNet (Swin), and Mask2Former (M2F).</p>
Full article ">Figure 10
<p>Thick oil layers with a dark black tone: histogram distribution graph and box plot of oil spill pixels extracted from the labels, DeepLabV3+, Swin-UPerNet, and Mask2Former. The <span class="html-italic">x</span>-axis values represent the digital numbers (DNs) from PlanetScope images. (<b>a</b>) Oil mask, (<b>b</b>) histogram, and (<b>c</b>) box plot.</p>
Full article ">Figure 11
<p>Thin oil layers with a bright silver tone: histogram distribution graph and box plot of oil spill pixels extracted from the labels, DeepLabV3+, Swin-UPerNet, and Mask2Former. The <span class="html-italic">x</span>-axis values represent the digital numbers (DNs) from PlanetScope images. (<b>a</b>) Oil mask, (<b>b</b>) histogram, and (<b>c</b>) box plot.</p>
Full article ">Figure 12
<p>Thin oil layers with a bright rainbow tone: histogram distribution graph and box plot of oil spill pixels extracted from the labels, DeepLabV3+, Swin-UPerNet, and Mask2Former. The <span class="html-italic">x</span>-axis values represent the digital numbers (DNs) from PlanetScope images. (<b>a</b>) Oil mask, (<b>b</b>) histogram, and (<b>c</b>) box plot.</p>
Full article ">
15 pages, 10336 KiB  
Technical Note
Multi-Scenario Remote Sensing Image Forgery Detection Based on Transformer and Model Fusion
by Jinmiao Zhao, Zelin Shi, Chuang Yu and Yunpeng Liu
Remote Sens. 2024, 16(22), 4311; https://doi.org/10.3390/rs16224311 - 19 Nov 2024
Viewed by 258
Abstract
Recently, remote sensing image forgery detection has received widespread attention. To improve the detection accuracy, we build a novel scheme based on Transformer and model fusion. Specifically, we model this task as a binary classification task that focuses on global information. First, we [...] Read more.
Recently, remote sensing image forgery detection has received widespread attention. To improve the detection accuracy, we build a novel scheme based on Transformer and model fusion. Specifically, we model this task as a binary classification task that focuses on global information. First, we explore the performance of various excellent feature extraction networks in this task under the constructed unified classification framework. On this basis, we select three high-performance Transformer-based networks that focus on global information, namely, Swin Transformer V1, Swin Transformer V2, and Twins, as the backbone networks and fuse them. Secondly, considering the small number of samples, we use the public ImageNet-1K dataset to pre-train the network to learn more stable feature expressions. At the same time, a circular data divide strategy is proposed, which can fully utilize all the samples to improve the accuracy in the competition. Finally, to promote network optimization, on the one hand, we explore multiple loss functions and select label smooth loss, which can reduce the model’s excessive dependence on training data. On the other hand, we construct a combined learning rate optimization strategy that first uses step degeneration and then cosine annealing, which reduces the risk of the network falling into local optima. Extensive experiments show that the proposed scheme has excellent performance. This scheme won seventh place in the “Forgery Detection in Multi-scenario Remote Sensing Images of Typical Objects” track of the 2024 ISPRS TC I contest on Intelligent Interpretation for Multi-modal Remote Sensing Application. Full article
(This article belongs to the Special Issue Geospatial Artificial Intelligence (GeoAI) in Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>Overall structure of proposed scheme.</p>
Full article ">Figure 2
<p>High-performance forgery detection network architecture.</p>
Full article ">Figure 3
<p>Circular data divide strategy.</p>
Full article ">Figure 4
<p>Displays of some samples from the dataset. The label of the sample on the left side of the dotted line is 0, representing real images. The label of the sample on the right side of the dotted line is 1, representing fake images.</p>
Full article ">Figure 5
<p>Detailed classification comparison between the model prediction results and true labels.</p>
Full article ">Figure 6
<p>Displays of some detection results. Top left: the label is true, but the prediction is false. Top right: the label is true, and the prediction is true. Bottom left: the label is false, but the prediction is true. Bottom right: the label is false, and the prediction is false.</p>
Full article ">
21 pages, 10435 KiB  
Article
SG-LPR: Semantic-Guided LiDAR-Based Place Recognition
by Weizhong Jiang, Hanzhang Xue, Shubin Si, Chen Min, Liang Xiao, Yiming Nie and Bin Dai
Electronics 2024, 13(22), 4532; https://doi.org/10.3390/electronics13224532 - 18 Nov 2024
Viewed by 272
Abstract
Place recognition plays a crucial role in tasks such as loop closure detection and re-localization in robotic navigation. As a high-level representation within scenes, semantics enables models to effectively distinguish geometrically similar places, therefore enhancing their robustness to environmental changes. Unlike most existing [...] Read more.
Place recognition plays a crucial role in tasks such as loop closure detection and re-localization in robotic navigation. As a high-level representation within scenes, semantics enables models to effectively distinguish geometrically similar places, therefore enhancing their robustness to environmental changes. Unlike most existing semantic-based LiDAR place recognition (LPR) methods that adopt a multi-stage and relatively segregated data-processing and storage pipeline, we propose a novel end-to-end LPR model guided by semantic information—SG-LPR. This model introduces a semantic segmentation auxiliary task to guide the model in autonomously capturing high-level semantic information from the scene, implicitly integrating these features into the main LPR task, thus providing a unified framework of “segmentation-while-describing” and avoiding additional intermediate data-processing and storage steps. Moreover, the semantic segmentation auxiliary task operates only during model training, therefore not adding any time overhead during the testing phase. The model also combines the advantages of Swin Transformer and U-Net to address the shortcomings of current semantic-based LPR methods in capturing global contextual information and extracting fine-grained features. Extensive experiments conducted on multiple sequences from the KITTI and NCLT datasets validate the effectiveness, robustness, and generalization ability of our proposed method. Our approach achieves notable performance improvements over state-of-the-art methods. Full article
(This article belongs to the Collection Advance Technologies of Navigation for Intelligent Vehicles)
Show Figures

Figure 1

Figure 1
<p>Comparison of system frameworks for semantic-based LPR methods. (<b>a</b>) represents the prevalent “segmentation-then-describing” framework employed by most existing semantic-based LPR methods. This framework comprises multiple distinct stages. (<b>b</b>) depicts our proposed “segmentation-while-describing” framework, which implicitly provides high-level semantic features into the primary LPR task through an auxiliary semantic segmentation task (indicated by the yellow dashed box, which is active only during model training).</p>
Full article ">Figure 2
<p>Overview of the proposed SG-LPR architecture. It consists of a shared feature extractor (blue area), followed by two parallel branches: one for the LPR task (yellow area) and another for the semantic segmentation task (gray area). These branches are jointly trained to implement the “Segmentation-while-describing” framework. Notably, the semantic segmentation branch is active only during training and incurs no additional computational cost during testing.</p>
Full article ">Figure 3
<p>Architecture of the Feature Extraction Module. We construct this module based on Swin-Unet [<a href="#B47-electronics-13-04532" class="html-bibr">47</a>], with the semantic segmentation task branch guiding it to extract feature tensors that are rich in high-level semantic information from raw BEV images.</p>
Full article ">Figure 4
<p>Architecture of the LPR task branch. We construct this module based on Swin-Unet [<a href="#B47-electronics-13-04532" class="html-bibr">47</a>], with the semantic segmentation task branch guiding it to extract feature tensors that are rich in high-level semantic information from raw BEV images.</p>
Full article ">Figure 5
<p>The Precision–Recall curves on multiple sequences of KITTI dataset.</p>
Full article ">Figure 6
<p>Qualitative performance at top-1 retrieval of SG-LPR on multiple KITTI sequences along the trajectory. Red: true positives, black: false negatives, blue: true negatives.</p>
Full article ">Figure 7
<p>Qualitative performance for the auxiliary semantic segmentation task in SG-LPR. (<b>a</b>) shows the original BEV image generated from 3D LiDAR point cloud, (<b>b</b>) displays the ground-truth semantic map constructed from Semantic-KITTI [<a href="#B46-electronics-13-04532" class="html-bibr">46</a>], and (<b>c</b>) illustrates the predicted semantic map produced by SG-LPR, guided by auxiliary semantic segmentation task during training.</p>
Full article ">Figure 8
<p>Feature heatmaps comparison of SG-LPR outputs with and without the semantic segmentation auxiliary task. SwinUnetVLAD is the SG-LPR variant without the semantic segmentation auxiliary task branch. The heatmaps illustrate the differences in feature activation patterns, highlighting the influence of the auxiliary task on the model’s ability to capture regions with high-level semantic features.</p>
Full article ">Figure 9
<p>Ablation study on the number of semantic categories used for the training of our SG-LPR model.</p>
Full article ">
22 pages, 7431 KiB  
Article
EDH-STNet: An Evaporation Duct Height Spatiotemporal Prediction Model Based on Swin-Unet Integrating Multiple Environmental Information Sources
by Hanjie Ji, Lixin Guo, Jinpeng Zhang, Yiwen Wei, Xiangming Guo and Yusheng Zhang
Remote Sens. 2024, 16(22), 4227; https://doi.org/10.3390/rs16224227 - 13 Nov 2024
Viewed by 512
Abstract
Given the significant spatial non-uniformity of marine evaporation ducts, accurately predicting the regional distribution of evaporation duct height (EDH) is crucial for ensuring the stable operation of radio systems. While machine-learning-based EDH prediction models have been extensively developed, they fail to provide the [...] Read more.
Given the significant spatial non-uniformity of marine evaporation ducts, accurately predicting the regional distribution of evaporation duct height (EDH) is crucial for ensuring the stable operation of radio systems. While machine-learning-based EDH prediction models have been extensively developed, they fail to provide the EDH distribution over large-scale regions in practical applications. To address this limitation, we have developed a novel spatiotemporal prediction model for EDH that integrates multiple environmental information sources, termed the EDH Spatiotemporal Network (EDH-STNet). This model is based on the Swin-Unet architecture, employing an Encoder–Decoder framework that utilizes consecutive Swin-Transformers. This design effectively captures complex spatial correlations and temporal characteristics. The EDH-STNet model also incorporates nonlinear relationships between various hydrometeorological parameters (HMPs) and EDH. In contrast to existing models, it introduces multiple HMPs to enhance these relationships. By adopting a data-driven approach that integrates these HMPs as prior information, the accuracy and reliability of spatiotemporal predictions are significantly improved. Comprehensive testing and evaluation demonstrate that the EDH-STNet model, which merges an advanced deep learning algorithm with multiple HMPs, yields accurate predictions of EDH for both immediate and future timeframes. This development offers a novel solution to ensure the stable operation of radio systems. Full article
(This article belongs to the Section Atmospheric Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>M-profile obtained by the NPS model.</p>
Full article ">Figure 2
<p>Spatial distributions of (<b>a</b>) AT, (<b>b</b>) AP, (<b>c</b>) SST, (<b>d</b>) WS, and (<b>e</b>) RH in June 2023.</p>
Full article ">Figure 2 Cont.
<p>Spatial distributions of (<b>a</b>) AT, (<b>b</b>) AP, (<b>c</b>) SST, (<b>d</b>) WS, and (<b>e</b>) RH in June 2023.</p>
Full article ">Figure 3
<p>Spatial distributions of calculated EDH in June 2023.</p>
Full article ">Figure 4
<p>Schematic of the EDH-STNet model.</p>
Full article ">Figure 5
<p>Partial EDH distribution in (<b>a</b>) Test2022 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 5 Cont.
<p>Partial EDH distribution in (<b>a</b>) Test2022 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 6
<p>Partial EDH distribution in (<b>a</b>) Test2023 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 6 Cont.
<p>Partial EDH distribution in (<b>a</b>) Test2023 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 7
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2022.</p>
Full article ">Figure 7 Cont.
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2022.</p>
Full article ">Figure 8
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2023.</p>
Full article ">Figure 8 Cont.
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2023.</p>
Full article ">Figure 9
<p>Predictions of all models for measured EDH.</p>
Full article ">
18 pages, 4997 KiB  
Article
Robotic Grasping Detection Algorithm Based on 3D Vision Dual-Stream Encoding Strategy
by Minglin Lei, Pandong Wang, Hua Lei, Jieyun Ma, Wei Wu and Yongtao Hao
Electronics 2024, 13(22), 4432; https://doi.org/10.3390/electronics13224432 - 12 Nov 2024
Viewed by 420
Abstract
The automatic generation of stable robotic grasping postures is crucial for the application of computer vision algorithms in real-world settings. This task becomes especially challenging in complex environments, where accurately identifying the geometric shapes and spatial relationships between objects is essential. To enhance [...] Read more.
The automatic generation of stable robotic grasping postures is crucial for the application of computer vision algorithms in real-world settings. This task becomes especially challenging in complex environments, where accurately identifying the geometric shapes and spatial relationships between objects is essential. To enhance the capture of object pose information in 3D visual scenes, we propose a planar robotic grasping detection algorithm named SU-Grasp, which simultaneously focuses on local regions and long-distance relationships. Built upon a U-shaped network, SU-Grasp introduces a novel dual-stream encoding strategy using the Swin Transformer combined with spatial semantic enhancement. Compared to existing baseline methods, our algorithm achieves superior performance across public datasets, simulation tests, and real-world scenarios, highlighting its robust understanding of complex spatial environments. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications)
Show Figures

Figure 1

Figure 1
<p>Structural diagram of the SU-Grasp. We encode the color modality (RGB image) and the spatial modality (depth image and normal vector angle image, abbreviated as DI image) separately, followed by cross-modal fusion. The decoder outputs the grasp confidence and grasp parameters for each pixel. Red boxes in the figure indicate the top three results with the highest confidence.</p>
Full article ">Figure 2
<p>Example of each channel in the normal vector angle image.</p>
Full article ">Figure 3
<p>Overview of the SU-Grasp framework. Firstly, a normal vector angle image is derived from the depth image. A dual-stream encoder based on Swin Transformer is then employed to separately extract multi-scale features for color and spatial information. Finally, a U-shaped structure decodes these features into three pixel-level prediction heatmaps for grasping parameters, specifically representing grasp confidence, grasp width, and rotation angle.</p>
Full article ">Figure 4
<p>(<b>a</b>) A simplified diagram of the original Swin-Unet. (<b>b</b>) The modified structure featuring a dual-stream cross-modal fusion encoder. This modification can be understood as halving the first part of the original encoder.</p>
Full article ">Figure 5
<p>Training loss variation curves of three models (SU-Grasp, Swin-Unet, and GRCNN) on the Cornell and Jacquard datasets.</p>
Full article ">Figure 6
<p>Visualization of SU-Grasp’s prediction results and output heatmaps. The blue boxes in (<b>a</b>) represent the predicted poses with the highest success probability. The heatmaps in (<b>b</b>) represent the predicted values of success probability, grasp width, and rotation angle.</p>
Full article ">Figure 7
<p>Changes in total loss values in ablation experiment of the normal vector angle image.</p>
Full article ">Figure 8
<p>Changes in total loss values in ablation experiment of the normal vector angle image.</p>
Full article ">Figure 9
<p>Changes in total loss values in ablation experiment of the dual-stream encoder structure.</p>
Full article ">Figure 10
<p>Simulation test results of SU-Grasp.</p>
Full article ">Figure 11
<p>Illustration of grasping tests in real-world scenarios.</p>
Full article ">
15 pages, 4396 KiB  
Article
Breast Cancer Classification Using Fine-Tuned SWIN Transformer Model on Mammographic Images
by Oluwatosin Tanimola, Olamilekan Shobayo, Olusogo Popoola and Obinna Okoyeigbo
Analytics 2024, 3(4), 461-475; https://doi.org/10.3390/analytics3040026 - 11 Nov 2024
Viewed by 546
Abstract
Breast cancer is the most prevalent type of disease among women. It has become one of the foremost causes of death among women globally. Early detection plays a significant role in administering personalized treatment and improving patient outcomes. Mammography procedures are often used [...] Read more.
Breast cancer is the most prevalent type of disease among women. It has become one of the foremost causes of death among women globally. Early detection plays a significant role in administering personalized treatment and improving patient outcomes. Mammography procedures are often used to detect early-stage cancer cells. This traditional method of mammography while valuable has limitations in its potential for false positives and negatives, patient discomfort, and radiation exposure. Therefore, there is a probe for more accurate techniques required in detecting breast cancer, leading to exploring the potential of machine learning in the classification of diagnostic images due to its efficiency and accuracy. This study conducted a comparative analysis of pre-trained CNNs (ResNet50 and VGG16) and vision transformers (ViT-base and SWIN transformer) with the inclusion of ViT-base trained from scratch model architectures to effectively classify mammographic breast cancer images into benign and malignant cases. The SWIN transformer exhibits superior performance with 99.9% accuracy and a precision of 99.8%. These findings demonstrate the efficiency of deep learning to accurately classify mammographic breast cancer images for the diagnosis of breast cancer, leading to improvements in patient outcomes. Full article
Show Figures

Figure 1

Figure 1
<p>SWIN transformer architecture with equation [<a href="#B29-analytics-03-00026" class="html-bibr">29</a>].</p>
Full article ">Figure 2
<p>Experimental development framework.</p>
Full article ">Figure 3
<p>Plot of class distribution of training dataset.</p>
Full article ">Figure 4
<p>Plot of class distribution of training dataset.</p>
Full article ">Figure 5
<p>Random samples of mammogram image dataset.</p>
Full article ">Figure 6
<p>Performance plot of training loss and accuracy for ViT-base pre-trained model.</p>
Full article ">Figure 7
<p>Performance plot of training loss and accuracy for SWIN transformer pre-trained model.</p>
Full article ">Figure 8
<p>Performance plot of training loss and accuracy for ResNet50 pre-trained model.</p>
Full article ">Figure 9
<p>Performance plot of training loss and accuracy for VGG16 pre-trained model.</p>
Full article ">Figure 10
<p>Confusion matrix of VGG16.</p>
Full article ">Figure 11
<p>Confusion matrix of ResNet50.</p>
Full article ">Figure 12
<p>Confusion matrix of ViT-base.</p>
Full article ">Figure 13
<p>Confusion matrix of SWIN transformer.</p>
Full article ">
22 pages, 12107 KiB  
Article
Deep Learning-Based Classification of Macrofungi: Comparative Analysis of Advanced Models for Accurate Fungi Identification
by Sifa Ozsari, Eda Kumru, Fatih Ekinci, Ilgaz Akata, Mehmet Serdar Guzel, Koray Acici, Eray Ozcan and Tunc Asuroglu
Sensors 2024, 24(22), 7189; https://doi.org/10.3390/s24227189 - 9 Nov 2024
Viewed by 630
Abstract
This study focuses on the classification of six different macrofungi species using advanced deep learning techniques. Fungi species, such as Amanita pantherina, Boletus edulis, Cantharellus cibarius, Lactarius deliciosus, Pleurotus ostreatus and Tricholoma terreum were chosen based on their ecological [...] Read more.
This study focuses on the classification of six different macrofungi species using advanced deep learning techniques. Fungi species, such as Amanita pantherina, Boletus edulis, Cantharellus cibarius, Lactarius deliciosus, Pleurotus ostreatus and Tricholoma terreum were chosen based on their ecological importance and distinct morphological characteristics. The research employed 5 different machine learning techniques and 12 deep learning models, including DenseNet121, MobileNetV2, ConvNeXt, EfficientNet, and swin transformers, to evaluate their performance in identifying fungi from images. The DenseNet121 model demonstrated the highest accuracy (92%) and AUC score (95%), making it the most effective in distinguishing between species. The study also revealed that transformer-based models, particularly the swin transformer, were less effective, suggesting room for improvement in their application to this task. Further advancements in macrofungi classification could be achieved by expanding datasets, incorporating additional data types such as biochemical, electron microscopy, and RNA/DNA sequences, and using ensemble methods to enhance model performance. The findings contribute valuable insights into both the use of deep learning for biodiversity research and the ecological conservation of macrofungi species. Full article
Show Figures

Figure 1

Figure 1
<p>Overview of datasets utilized for training AI algorithms, presented from a macroscopic perspective.</p>
Full article ">Figure 2
<p>Validation accuracy.</p>
Full article ">Figure 3
<p>ROC curve.</p>
Full article ">Figure 4
<p>İmages without Grad-CAM visualization.</p>
Full article ">Figure 5
<p>ConvNeXt Grad-CAM visualization.</p>
Full article ">Figure 6
<p>EfficientNet Grad-CAM visualization.</p>
Full article ">Figure 7
<p>DenseNet121, InceptionV3, and InceptionResNetV2 Grad-CAM visualization.</p>
Full article ">Figure 8
<p>MobileNetV2, ResNet152, and Xception Grad-CAM visualization.</p>
Full article ">Figure 9
<p>Different levels of Gaussian white noise [<a href="#B40-sensors-24-07189" class="html-bibr">40</a>].</p>
Full article ">Figure 10
<p>DenseNet121 and MobileNetV2 Grad-CAM visualization on SNR-10 noisy images.</p>
Full article ">
12 pages, 1905 KiB  
Article
An Algorithmic Study of Transformer-Based Road Scene Segmentation in Autonomous Driving
by Hao Cui and Juyang Lei
World Electr. Veh. J. 2024, 15(11), 516; https://doi.org/10.3390/wevj15110516 - 8 Nov 2024
Viewed by 481
Abstract
Applications such as autonomous driving require high-precision semantic image segmentation technology to identify and understand the content of each pixel in the images. Compared with traditional deep convolutional neural networks, the Transformer model is based on pure attention mechanisms, without convolutional layers or [...] Read more.
Applications such as autonomous driving require high-precision semantic image segmentation technology to identify and understand the content of each pixel in the images. Compared with traditional deep convolutional neural networks, the Transformer model is based on pure attention mechanisms, without convolutional layers or recurrent neural network layers. In this paper, we propose a new network structure called SwinLab, which is an improvement upon the Swin Transformer. Experimental results demonstrate that the improved SwinLab model achieves a segmentation accuracy comparable to that of deep convolutional neural network models in applications such as autonomous driving, with an MIoU of 77.61. Additionally, comparative experiments on the CityScapes dataset further validate the effectiveness and generalization of this structure. In conclusion, by refining the Swin Transformer, this paper simplifies the model structure, improves the training and inference speed, and maintains high accuracy, providing a more reliable semantic image segmentation solution for applications such as autonomous driving. Full article
Show Figures

Figure 1

Figure 1
<p>Label map in P mode.</p>
Full article ">Figure 2
<p>Overall model structure of the network.</p>
Full article ">Figure 3
<p>Diagram of the encoder structure.</p>
Full article ">Figure 4
<p>Diagram of the MLP structure.</p>
Full article ">Figure 5
<p>Diagram of the decoder structure.</p>
Full article ">Figure 6
<p>Pascal VOC2012 dataset.</p>
Full article ">Figure 7
<p>Cityscapes dataset.</p>
Full article ">
24 pages, 10567 KiB  
Article
Dual-Modal Fusion PRI-SWT Model for Eddy Current Detection of Cracks, Delamination, and Impact Damage in Carbon Fiber-Reinforced Plastic Materials
by Rongyan Wen, Chongcong Tao, Hongli Ji and Jinhao Qiu
Appl. Sci. 2024, 14(22), 10282; https://doi.org/10.3390/app142210282 - 8 Nov 2024
Viewed by 564
Abstract
Carbon fiber-reinforced plastic (CFRP) composites are prone to damage during both manufacturing and operational phases, making the classification and identification of defects critical for maintaining structural integrity. This paper presents a novel dual-modal feature classification approach for the eddy current detection of CFRP [...] Read more.
Carbon fiber-reinforced plastic (CFRP) composites are prone to damage during both manufacturing and operational phases, making the classification and identification of defects critical for maintaining structural integrity. This paper presents a novel dual-modal feature classification approach for the eddy current detection of CFRP defects, utilizing a Parallel Real–Imaginary/Swin Transformer (PRI-SWT) model. Built using the Transformer architecture, the PRI-SWT model effectively integrates the real and imaginary components of sinusoidal voltage signals, demonstrating a significant performance improvement over traditional classification methods such as Support Vector Machine (SVM) and Vision Transformer (ViT). The proposed model achieved a classification accuracy exceeding 95%, highlighting its superior capability in terms of addressing the complexities of defect detection. Furthermore, the influence of key factors—including the real–imaginary fusion layer, the number of layers, the window shift size, and the model’s scale—on the classification performance of the PRI-SWT model was systematically evaluated. Full article
Show Figures

Figure 1

Figure 1
<p>The figure presents the eddy current testing (ECT) results for the detection of cracks, delamination, and low-velocity impact damage in carbon fiber-reinforced polymer (CFRP) materials. Results were obtained utilizing a nine-array ECT probe. The sinusoidal voltage signal generated by the output coil of the probe is processed through a lock-in amplifier, yielding two output signals. The subsequent data processing of these signals results in amplitude and phase images of the scanning process. (<b>a</b>) Crack defect; (<b>b</b>) impact defect; (<b>c</b>) delamination defect.</p>
Full article ">Figure 2
<p>The figure illustrates the basic framework of multi-modal learning, where data from various modalities, such as images, text, and speech, are encoded into tokens and input into the multi-modal learning model. The output tokens are then converted back into their respective natural data types.</p>
Full article ">Figure 3
<p>One specific type of cross-modal alignment is explicit expanded modal alignment. The figure shows a schematic diagram illustrating the correspondence between the image modality and the text data modality.</p>
Full article ">Figure 4
<p>Different types of multi-modal fusion. (<b>a</b>) Data-level fusion; (<b>b</b>) feature-level fusion; (<b>c</b>) decision-level fusion.</p>
Full article ">Figure 5
<p>SVM processing of linearly non-separable datasets, with the high-dimensional hyperplane projected as a curve in the original space.</p>
Full article ">Figure 6
<p>The basic framework of Vi-T, obtaining the final classification result through steps such as linear projection and multi-head attention computation. The figure also illustrates the CFRP eddy current nondestructive testing (NDT) system, which primarily consists of key components such as a function signal generator, an eddy current probe, a lock-in amplifier, a low-voltage DC power supply, a motion controller, a data acquisition card, and an industrial control computer. Additionally, the figure shows several CFRP test specimens with defects, including CFRP samples with delamination and crack defects. The upper-left figure shows the Chinese software interface of the CFRP eddy current testing system.</p>
Full article ">Figure 7
<p>The figure illustrates the mathematical computation process used for self-attention. Here, <math display="inline"><semantics> <mrow> <mi>X</mi> </mrow> </semantics></math> represents the input image tensor, which is derived by segmenting the input image into individual patches and then concatenating these patches along the feature channel dimension. The “width” and “height” dimensions of the three-dimensional feature tensor are then flattened into a one-dimensional format, resulting in a feature matrix <math display="inline"><semantics> <mrow> <mi>X</mi> </mrow> </semantics></math> with dimensions <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>i</mi> <mi>n</mi> </mrow> </msub> <mo>.</mo> </mrow> </semantics></math> After adding a positional encoding matrix of the same dimensions, the query (<math display="inline"><semantics> <mrow> <mi>Q</mi> <mo>)</mo> </mrow> </semantics></math>, key (<math display="inline"><semantics> <mrow> <mi>K</mi> </mrow> </semantics></math>), and value (<math display="inline"><semantics> <mrow> <mi>V</mi> </mrow> </semantics></math>) matrices are computed. Using the query and key matrices, the self-attention score matrix is obtained. Finally, this self-attention score matrix is multiplied by the value matrix to produce the resulting feature map, which maintains the same dimensions. During the computation process, the dimensional information of the <math display="inline"><semantics> <mrow> <mi>Q</mi> <mo>,</mo> <mi>K</mi> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>V</mi> </mrow> </semantics></math> matrices differs slightly from that of the input feature tensor, typically represented as <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math>. This can also be set to <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> </semantics></math>, depending on the specific application context. The dimensions of the linear mapping matrices <math display="inline"><semantics> <mrow> <msup> <mrow> <mi>W</mi> </mrow> <mrow> <mi>Q</mi> </mrow> </msup> <mo>,</mo> <msup> <mrow> <mi>W</mi> </mrow> <mrow> <mi>K</mi> </mrow> </msup> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msup> <mrow> <mi>W</mi> </mrow> <mrow> <mi>V</mi> </mrow> </msup> </mrow> </semantics></math> are all <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>i</mi> <mi>n</mi> </mrow> </msub> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math>. The attention score matrix has dimensions of <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <mi>N</mi> </mrow> </semantics></math>. Therefore, when this matrix is multiplied with the value matrix (which has dimensions <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math>), the resulting output feature map maintains the dimensions of <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>×</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>s</mi> </mrow> </msub> </mrow> </semantics></math>. If we set <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>s</mi> </mrow> </msub> <mo>=</mo> <msub> <mrow> <mi>d</mi> </mrow> <mrow> <mi>i</mi> <mi>n</mi> </mrow> </msub> </mrow> </semantics></math>, the output feature map’s dimensions will match those of the input feature map.</p>
Full article ">Figure 8
<p>The left side of the figure presents a schematic representation of the PRI-SWT model, detailing all its structures and operational processes. This includes key modules such as the separation layer, the patch embedding layer, the patch merging layer, the Swin Transformer module, and the fusion layer. The figure also provides information on the dimensions of the feature map tensors corresponding to the outputs of different modules. As the depth of the layers increases, the resolution of the feature maps gradually decreases, while the number of feature channels steadily increases. On the right side, the figure illustrates the internal details of the Swin Transformer module, which primarily comprises the window multi-head self-attention calculation module, the shifted window multi-head attention calculation module, the normalization layer, and the linear mapping MLP layer.</p>
Full article ">Figure 9
<p>The working mechanism of the Sliding Window Transformer. The image is segmented into numbered windows based on varying offset values and window sizes. These windows are then shifted and padded horizontally and vertically, forming structured regions for self-attention computation.</p>
Full article ">Figure 10
<p>The figure provides detailed confusion matrices for several classification models. It also includes the ROC curves for the typical ViT and PRI-SWT models, with the AUC values for the classification of the three CFRP defect types clearly indicated in the ROC curve images. In the figures, labels 0, 1, and 2 correspond to crack, delamination, and impact damage defects, respectively. The results demonstrate the relative superiority of the PRI-SWT model.</p>
Full article ">Figure 10 Cont.
<p>The figure provides detailed confusion matrices for several classification models. It also includes the ROC curves for the typical ViT and PRI-SWT models, with the AUC values for the classification of the three CFRP defect types clearly indicated in the ROC curve images. In the figures, labels 0, 1, and 2 correspond to crack, delamination, and impact damage defects, respectively. The results demonstrate the relative superiority of the PRI-SWT model.</p>
Full article ">Figure 11
<p>Eddy current testing results for three types of CFRP defect: (<b>a</b>) crack of small size; (<b>b</b>) crack of medium size; (<b>c</b>) low-velocity impact of large size. The three types of CFRP defects vary in shape and scale.</p>
Full article ">Figure 12
<p>Accuracy curves of Vi-T and PRI-SWT models for the first 50 training epochs.</p>
Full article ">Figure 13
<p>Model’s accuracy curves and confusion matrix when using different fusion layers. (<b>a</b>) Confusion matrix; (<b>b</b>) training curves. The performance of the linear fusion layer surpasses that of the CNN fusion layer. However, increasing the number of fusion layers tends to decrease the model’s overall performance.</p>
Full article ">Figure 14
<p>Model accuracy curves and confusion matrix obtained using different shifted values. (<b>a</b>) Confusion matrix; (<b>b</b>) accuracy curves. A smaller window size results in slightly higher final classification performance compared to the other cases.</p>
Full article ">Figure 15
<p>Model accuracy curves and confusion matrix using different window sizes. (<b>a</b>) Confusion matrix; (<b>b</b>) training curves.</p>
Full article ">Figure 16
<p>Model’s confusion matrix using different depths of PRI-SWT. Increasing the network depth can enhance the performance of the PRI-SWT model; however, it also leads to higher computational costs, requiring a balance between the two.</p>
Full article ">
17 pages, 5121 KiB  
Article
Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China
by Deli Zhu and Peiji Yang
Sustainability 2024, 16(22), 9717; https://doi.org/10.3390/su16229717 - 7 Nov 2024
Viewed by 491
Abstract
Forest fires pose a significant threat to forest ecosystems, with severe impacts on both the environment and human society. Understanding the post-fire recovery processes of forests is crucial for developing strategies for species diversity conservation and ecological restoration and preventing further damage. The [...] Read more.
Forest fires pose a significant threat to forest ecosystems, with severe impacts on both the environment and human society. Understanding the post-fire recovery processes of forests is crucial for developing strategies for species diversity conservation and ecological restoration and preventing further damage. The present study proposes applying the EAswin-Mask2former model based on semantic segmentation in deep learning using visible light band data to better monitor the evolution of burn areas in forests after fires. This model is an improvement of the classical semantic segmentation model Mask2former and can better adapt to the complex environment of burned forest areas. This model employs Swin-Transformer as the backbone for feature extraction, which is particularly advantageous for processing high-resolution images. It also includes the Contextual Transformer (CoT) Block to better capture contextual information capture and incorporates the Efficient Multi-Scale Attention (EMA) Block into the Efficiently Adaptive (EA) Block to enhance the model’s ability to learn key features and long-range dependencies. The experimental results demonstrate that the EAswin-Mask2former model can achieve a mean Intersection-over-Union (mIoU) of 76.35% in segmenting complex forest burn areas across different seasons, representing improvements of 3.26 and 0.58 percentage points, respectively, over the Mask2former models using ResNet and Swin-Transformer backbones, respectively. Moreover, this method surpasses the performance of the DeepLabV3+ and Segformer models by 4.04 and 1.75 percentage points, respectively. Ultimately, the proposed model offers excellent segmentation performance for both forest and burn areas and can effectively track the evolution of burned forests when combined with unmanned aerial vehicle (UAV) remote sensing images. Full article
(This article belongs to the Section Sustainable Forestry)
Show Figures

Figure 1

Figure 1
<p>Location of the study area. Source from: <a href="https://www.google.com.hk/maps/" target="_blank">https://www.google.com.hk/maps/</a>, accessed on 13 October 2024. Source from: <a href="http://www.bigemap.com/" target="_blank">http://www.bigemap.com/</a>, accessed on 13 October 2024.</p>
Full article ">Figure 2
<p>Image data of the same forest area.</p>
Full article ">Figure 3
<p>Overall architecture of Mask2former. The Pixel Decoder obtains the outputs of all stages in the feature extraction network and converts them into pixel-level prediction results, obtaining output features with multiple sizes. The largest output feature is used to calculate the mask, while the smaller output features are used as inputs to the Transformer Decoder.</p>
Full article ">Figure 4
<p>(<b>a</b>) Swin Transformer Network Architecture; (<b>b</b>) Swin Transformer Block Structure (right). The figure on the right shows two Swin Transformer Blocks connected in series. In network architecture, this structure appears in pairs, at least two of which are grouped together. In this structure, W-MSA represents Multi Head Self-Attention with a window, while SW-MSA represents Multi Head Self-Attention with a sliding window.</p>
Full article ">Figure 5
<p>(<b>a</b>) Window Multi-Head Self-Attention, W-MSA and (<b>b</b>) Shifted Window Multi-Head Self-Attention, SW-MSA.</p>
Full article ">Figure 6
<p>The approximate calculation process for the adaptive module EA Block. In the figure, the “+” inside a circle indicates that the inputs to that node are added together, and the “*” inside a circle indicates that the inputs are multiplied together.</p>
Full article ">Figure 7
<p>Structure of EAswin-Mask2former.</p>
Full article ">Figure 8
<p>Some segmentation results of EAswin-Mask2former and other models.</p>
Full article ">Figure 9
<p>Comparison of mIou between EAswin-Mask2former and other models. DLV3+, SEG, R-M, and S-M in the table represent DeepLabV3+, Segformer, Resnet50-Mask2former, and Swin-Mask2former, respectively.</p>
Full article ">Figure 10
<p>Satellite remote sensing images of the forest area from 2022 to 2024.</p>
Full article ">Figure 11
<p>Unmanned aerial vehicle images of Region A at different times and their segmentation effects. The corresponding shooting times from top to bottom are October 2022, March 2023, March 2023, and February 2024.</p>
Full article ">Figure 11 Cont.
<p>Unmanned aerial vehicle images of Region A at different times and their segmentation effects. The corresponding shooting times from top to bottom are October 2022, March 2023, March 2023, and February 2024.</p>
Full article ">Figure 12
<p>The trend over time of the burned and damaged area and the proportion of forest area in Region A.</p>
Full article ">
22 pages, 5584 KiB  
Article
Enhanced Magnetic Resonance Imaging-Based Brain Tumor Classification with a Hybrid Swin Transformer and ResNet50V2 Model
by Abeer Fayez Al Bataineh, Khalid M. O. Nahar, Hayel Khafajeh, Ghassan Samara, Raed Alazaidah, Ahmad Nasayreh, Ayah Bashkami, Hasan Gharaibeh and Waed Dawaghreh
Appl. Sci. 2024, 14(22), 10154; https://doi.org/10.3390/app142210154 - 6 Nov 2024
Viewed by 535
Abstract
Brain tumors can be serious; consequently, rapid and accurate detection is crucial. Nevertheless, a variety of obstacles, such as poor imaging resolution, doubts over the accuracy of data, a lack of diverse tumor classes and stages, and the possibility of misunderstanding, present challenges [...] Read more.
Brain tumors can be serious; consequently, rapid and accurate detection is crucial. Nevertheless, a variety of obstacles, such as poor imaging resolution, doubts over the accuracy of data, a lack of diverse tumor classes and stages, and the possibility of misunderstanding, present challenges to achieve an accurate and final diagnosis. Effective brain cancer detection is crucial for patients’ safety and health. Deep learning systems provide the capability to assist radiologists in quickly and accurately detecting diagnoses. This study presents an innovative deep learning approach that utilizes the Swin Transformer. The suggested method entails integrating the Swin Transformer with the pretrained deep learning model Resnet50V2, called (SwT+Resnet50V2). The objective of this modification is to decrease memory utilization, enhance classification accuracy, and reduce training complexity. The self-attention mechanism of the Swin Transformer identifies distant relationships and captures the overall context. Resnet 50V2 improves both accuracy and training speed by extracting adaptive features from the Swin Transformer’s dependencies. We evaluate the proposed framework using two publicly accessible brain magnetic resonance imaging (MRI) datasets, each including two and four distinct classes, respectively. Employing data augmentation and transfer learning techniques enhances model performance, leading to more dependable and cost-effective training. The suggested model achieves an impressive accuracy of 99.9% on the binary-labeled dataset and 96.8% on the four-labeled dataset, outperforming the VGG16, MobileNetV2, Resnet50V2, EfficientNetV2B3, ConvNeXtTiny, and convolutional neural network (CNN) algorithms used for comparison. This demonstrates that the Swin transducer, when combined with Resnet50V2, is capable of accurately diagnosing brain tumors. This method leverages the combination of SwT+Resnet50V2 to create an innovative diagnostic tool. Radiologists have the potential to accelerate and improve the detection of brain tumors, leading to improved patient outcomes and reduced risks. Full article
(This article belongs to the Special Issue Advances in Bioinformatics and Biomedical Engineering)
Show Figures

Figure 1

Figure 1
<p>Workflow diagram of the proposed brain tumor detection method.</p>
Full article ">Figure 2
<p>Architecture of Swin Transformer.</p>
Full article ">Figure 3
<p>Architecture of Resnet50V2.</p>
Full article ">Figure 4
<p>Instances of the types of brain tumor in MRI images.</p>
Full article ">Figure 5
<p>Performance evaluation on Bra35H dataset.</p>
Full article ">Figure 6
<p>Training and validation metrics (accuracy and loss) for (SwT+Resnet50V2) on Bra35H dataset.</p>
Full article ">Figure 7
<p>Comparison of confusion matrices for all models using the Bra35H dataset.</p>
Full article ">Figure 8
<p>Performance evaluation on Kaggle dataset.</p>
Full article ">Figure 9
<p>Training and validation metrics (accuracy and loss) for (SwT+Resnet50V2) on Kaggle dataset.</p>
Full article ">Figure 10
<p>Comparison of confusion matrices for all models using the Kaggle dataset.</p>
Full article ">
23 pages, 5919 KiB  
Article
Research on Soybean Seedling Stage Recognition Based on Swin Transformer
by Kai Ma, Jinkai Qiu, Ye Kang, Liqiang Qi, Wei Zhang, Song Wang and Xiuying Xu
Agronomy 2024, 14(11), 2614; https://doi.org/10.3390/agronomy14112614 - 6 Nov 2024
Viewed by 642
Abstract
Accurate identification of the second and third compound leaf periods of soybean seedlings is a prerequisite to ensure that soybeans are chemically weeded after seedling at the optimal application period. Accurate identification of the soybean seedling period is susceptible to natural light and [...] Read more.
Accurate identification of the second and third compound leaf periods of soybean seedlings is a prerequisite to ensure that soybeans are chemically weeded after seedling at the optimal application period. Accurate identification of the soybean seedling period is susceptible to natural light and complex field background factors. A transfer learning-based Swin-T (Swin Transformer) network is proposed to recognize different stages of the soybean seedling stage. A drone was used to collect images of soybeans at the true leaf stage, the first compound leaf stage, the second compound leaf stage, and the third compound leaf stage, and data enhancement methods such as image rotation and brightness enhancement were used to expand the dataset, simulate the drone’s collection of images at different shooting angles and weather conditions, and enhance the adaptability of the model. The field environment and shooting equipment directly affect the quality of the captured images, and in order to test the anti-interference ability of different models, the Gaussian blur method was used to blur the images of the test set to different degrees. The Swin-T model was optimized by introducing transfer learning and combining hyperparameter combination experiments and optimizer selection experiments. The performance of the optimized Swin-T model was compared with the MobileNetV2, ResNet50, AlexNet, GoogleNet, and VGG16Net models. The results show that the optimized Swin-T model has an average accuracy of 98.38% in the test set, which is an improvement of 11.25%, 12.62%, 10.75%, 1.00%, and 0.63% compared with the MobileNetV2, ResNet50, AlexNet, GoogleNet, and VGG16Net models, respectively. The optimized Swin-T model is best in terms of recall and F1 score. In the performance degradation test of the motion blur level model, the maximum degradation accuracy, overall degradation index, and average degradation index of the optimized Swin-T model were 87.77%, 6.54%, and 2.18%, respectively. The maximum degradation accuracy was 7.02%, 7.48%, 10.15%, 3.56%, and 2.5% higher than the MobileNetV2, ResNet50, AlexNet, GoogleNet, and VGG16Net models, respectively. In the performance degradation test of the Gaussian fuzzy level models, the maximum degradation accuracy, overall degradation index, and average degradation index of the optimized Swin-T model were 94.3%, 3.85%, and 1.285%, respectively. Compared with the MobileNetV2, ResNet50, AlexNet, GoogleNet, and VGG16Net models, the maximum degradation accuracy was 12.13%, 15.98%, 16.7%, 2.2%, and 1.5% higher, respectively. Taking into account various degradation indicators, the Swin-T model can still maintain high recognition accuracy and demonstrate good anti-interference ability even when inputting blurry images caused by interference in shooting. It can meet the recognition of different growth stages of soybean seedlings in complex environments, providing a basis for post-seedling chemical weed control during the second and third compound leaf stages of soybeans. Full article
(This article belongs to the Section Precision and Digital Agriculture)
Show Figures

Figure 1

Figure 1
<p>Partially acquired visible spectral image of soybean seedlings. (<b>a</b>) Image of the true leaf period of soybeans. (<b>b</b>) Image of the first compound leaf stage of soybeans. (<b>c</b>) Image of the second compound leaf stage of soybeans. (<b>d</b>) Image of the third compound leaf stage of soybeans.</p>
Full article ">Figure 2
<p>Image samples of soybean seedlings at different growth stages. (<b>a</b>) Sample of the true leaf stage of soybeans. (<b>b</b>) Sample of the first compound leaf stage of soybeans. (<b>c</b>) Sample of the second compound leaf stage of soybeans. (<b>d</b>) Sample of the third compound leaf stage of soybeans.</p>
Full article ">Figure 3
<p>Example of data augmentation for the second compound leaf period of soybeans. (<b>a</b>) Original figure. (<b>b</b>) HSV data enhancement. (<b>c</b>) Revolved image. (<b>d</b>) Contrast enhancement. (<b>e</b>) Brightness adjustment.</p>
Full article ">Figure 4
<p>Different blur radius r processing effects: (<b>a</b>) original figure; (<b>b</b>) r = 1; (<b>c</b>) r = 2; (<b>d</b>) r = 3; and (<b>e</b>) r = 4.</p>
Full article ">Figure 5
<p>The motion blur processing effect of different blur levels: (<b>a</b>) original figure; (<b>b</b>) f = 1; (<b>c</b>) f = 2; (<b>d</b>) f = 3; and (<b>e</b>) f = 4.</p>
Full article ">Figure 6
<p>General framework diagram of the Swin-T network.</p>
Full article ">Figure 7
<p>The implementation process of the patch merging layer in the Swin-T architecture.</p>
Full article ">Figure 8
<p>Swin-T Block structure in the Swin-T architecture.</p>
Full article ">Figure 9
<p>MSA and W-MSA window partitioning mechanism. (<b>a</b>) MSA. (<b>b</b>) W-MSA.</p>
Full article ">Figure 10
<p>Description of the shift window process. (<b>a</b>) SW-MSA. (<b>b</b>) Cyclic shift. (<b>c</b>) Masked MSA. (<b>d</b>) Reverse cyclic shift.</p>
Full article ">Figure 11
<p>Comparison of the accuracy and loss values of different optimizers. (<b>a</b>) Optimizer accuracy value. (<b>b</b>) Optimizer loss value.</p>
Full article ">Figure 12
<p>Comparison of accuracy and loss values of different model training sets. (<b>a</b>) Training set accuracy. (<b>b</b>) Training set loss.</p>
Full article ">Figure 13
<p>Confusion matrix plot of the recognition results from different classification models. (<b>a</b>) MobileNetV2; (<b>b</b>) ResNet50; (<b>c</b>) AlexNet; (<b>d</b>) GoogleNet; (<b>e</b>) VGG16Net; and (<b>f</b>) optimized Swin-T.</p>
Full article ">Figure 14
<p>The influence of different levels of motion blur on the classification accuracy of different models.</p>
Full article ">Figure 15
<p>Classification accuracy of different models under different blur radii.</p>
Full article ">Figure 16
<p>Comparison of the recognition results of the Swin-T model trained on different data. (<b>a</b>) Recognition results of the Swin-T model without data enhancement training. (<b>b</b>) Recognition results of the Swin-T model trained with data augmentation.</p>
Full article ">Figure 17
<p>Comparison of the thermal power of different models at different soybean seedling stages.</p>
Full article ">
20 pages, 3531 KiB  
Article
Sea Surface Temperature Prediction Using ConvLSTM-Based Model with Deformable Attention
by Benyun Shi, Conghui Ge, Hongwang Lin, Yanpeng Xu, Qi Tan, Yue Peng and Hailun He
Remote Sens. 2024, 16(22), 4126; https://doi.org/10.3390/rs16224126 - 5 Nov 2024
Viewed by 614
Abstract
Sea surface temperature (SST) prediction has received increasing attention in recent years due to its paramount importance in the various fields of oceanography. Existing studies have shown that neural networks are particularly effective in making accurate SST predictions by efficiently capturing spatiotemporal dependencies [...] Read more.
Sea surface temperature (SST) prediction has received increasing attention in recent years due to its paramount importance in the various fields of oceanography. Existing studies have shown that neural networks are particularly effective in making accurate SST predictions by efficiently capturing spatiotemporal dependencies in SST data. Among various models, the ConvLSTM framework is notably prominent. This model skillfully combines convolutional neural networks (CNNs) with recurrent neural networks (RNNs), enabling it to simultaneously capture spatiotemporal dependencies within a single computational framework. To overcome the limitation that CNNs primarily capture local spatial information, in this paper we propose a novel model named DatLSTM that integrates a deformable attention transformer (DAT) module into the ConvLSTM framework, thereby enhancing its ability to process more complex spatial relationships effectively. Specifically, the DAT module adaptively focuses on salient features in space, while ConvLSTM further captures the temporal dependencies of spatial correlations in the SST data. In this way, DatLSTM can adaptively capture complex spatiotemporal dependencies between the preceding and current states within ConvLSTM. To evaluate the performance of the DatLSTM model, we conducted short-term SST forecasts in the Bohai Sea region with forecast lead times ranging from 1 to 10 days and compared its efficacy against several benchmark models, including ConvLSTM, PredRNN, TCTN, and SwinLSTM. Our experimental results show that the proposed model outperforms all of these models in terms of multiple evaluation metrics short-term SST prediction. The proposed model offers a new predictive learning method for improving the accuracy of spatiotemporal predictions in various domains, including meteorology, oceanography, and climate science. Full article
Show Figures

Figure 1

Figure 1
<p>The procedure of spatiotemporal SST forecasting using ConvLSTM-based model with deformable attention (DatLSTM). The procedure consists of two stages: the warm-up stage and the prediction stage. During the warm-up stage, we take the sequence of SST observations as input to the model, while during the prediction stage we use the output of the model at the previous time step as the input to the model at the current time step.</p>
Full article ">Figure 2
<p>Detailed structure of the DatLSTM cell; DAT denotes the deformable attention transformer module and LP denotes linear projection.</p>
Full article ">Figure 3
<p>Illustration of the deformable attention module. In the left part, a set of reference points (four colored points for illustration) is uniformly distributed across the study area, with their offsets learned from the queries through the offset network (shown in the right part). Subsequently, the deformed keys and values are projected from the sampled positions based on these deformed points. Additionally, a relative position bias is calculated using the deformed points to enhance the multihead attention mechanism, which then outputs the transformed features. Figure adapted from Xia et al. [<a href="#B34-remotesensing-16-04126" class="html-bibr">34</a>].</p>
Full article ">Figure 4
<p>SST forecast snapshots from 28 March 2013 based on different neural network models. The lead time ranges from 1 to 10 days. The color bar represents the temperature values.</p>
Full article ">Figure 5
<p>Spatial distribution of <math display="inline"><semantics> <msup> <mi>R</mi> <mn>2</mn> </msup> </semantics></math> averaged over all testing samples for different lead times with respect to the benchmark models, namely, ConvLSTM, PredRNN, TCTN, SwinLSTM, and the proposed DatLSTM.</p>
Full article ">Figure 6
<p>Spatial distribution of MAE (units: <math display="inline"><semantics> <mrow> <mo>°</mo> <mi mathvariant="normal">C</mi> </mrow> </semantics></math>) averaged over all testing samples for different lead times with respect to the benchmark models, namely, ConvLSTM, PredRNN, TCTN, SwinLST, and the proposed DatLSTM.</p>
Full article ">Figure 7
<p>Spatial distribution of RMSE (units: <math display="inline"><semantics> <mrow> <mo>°</mo> <mi mathvariant="normal">C</mi> </mrow> </semantics></math>) averaged over all testing samples for different lead times with respect to the benchmark models, namely, ConvLSTM, PredRNN, TCTN, SwinLSTM, and the proposed DatLSTM.</p>
Full article ">Figure 8
<p>Effective lead time according to RMSE ≤ 1.0 <math display="inline"><semantics> <mrow> <mo>°</mo> <mi mathvariant="normal">C</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>≥</mo> <mn>0.986</mn> </mrow> </semantics></math>. The numbers shown in the figure indicate the amount of lead time days for which DatLSTM can accurately make predictions.</p>
Full article ">Figure 9
<p>Two-dimensional scatter plot of the observed and predicted SSTs in all grid cells over all testing samples with lead times of 1 day, 5 days, and 10 days. The colors in the plot indicate the number of data points in each bin. The root mean square error (RMSE; units: <math display="inline"><semantics> <mrow> <mo>°</mo> <mi mathvariant="normal">C</mi> </mrow> </semantics></math>) and coefficient of determination (<math display="inline"><semantics> <msup> <mi>R</mi> <mn>2</mn> </msup> </semantics></math>) between the predicted and observed values are shown in the upper left corner.</p>
Full article ">
Back to TopTop