Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (56)

Search Parameters:
Keywords = Swin UNet

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 7431 KiB  
Article
EDH-STNet: An Evaporation Duct Height Spatiotemporal Prediction Model Based on Swin-Unet Integrating Multiple Environmental Information Sources
by Hanjie Ji, Lixin Guo, Jinpeng Zhang, Yiwen Wei, Xiangming Guo and Yusheng Zhang
Remote Sens. 2024, 16(22), 4227; https://doi.org/10.3390/rs16224227 - 13 Nov 2024
Viewed by 382
Abstract
Given the significant spatial non-uniformity of marine evaporation ducts, accurately predicting the regional distribution of evaporation duct height (EDH) is crucial for ensuring the stable operation of radio systems. While machine-learning-based EDH prediction models have been extensively developed, they fail to provide the [...] Read more.
Given the significant spatial non-uniformity of marine evaporation ducts, accurately predicting the regional distribution of evaporation duct height (EDH) is crucial for ensuring the stable operation of radio systems. While machine-learning-based EDH prediction models have been extensively developed, they fail to provide the EDH distribution over large-scale regions in practical applications. To address this limitation, we have developed a novel spatiotemporal prediction model for EDH that integrates multiple environmental information sources, termed the EDH Spatiotemporal Network (EDH-STNet). This model is based on the Swin-Unet architecture, employing an Encoder–Decoder framework that utilizes consecutive Swin-Transformers. This design effectively captures complex spatial correlations and temporal characteristics. The EDH-STNet model also incorporates nonlinear relationships between various hydrometeorological parameters (HMPs) and EDH. In contrast to existing models, it introduces multiple HMPs to enhance these relationships. By adopting a data-driven approach that integrates these HMPs as prior information, the accuracy and reliability of spatiotemporal predictions are significantly improved. Comprehensive testing and evaluation demonstrate that the EDH-STNet model, which merges an advanced deep learning algorithm with multiple HMPs, yields accurate predictions of EDH for both immediate and future timeframes. This development offers a novel solution to ensure the stable operation of radio systems. Full article
(This article belongs to the Section Atmospheric Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>M-profile obtained by the NPS model.</p>
Full article ">Figure 2
<p>Spatial distributions of (<b>a</b>) AT, (<b>b</b>) AP, (<b>c</b>) SST, (<b>d</b>) WS, and (<b>e</b>) RH in June 2023.</p>
Full article ">Figure 2 Cont.
<p>Spatial distributions of (<b>a</b>) AT, (<b>b</b>) AP, (<b>c</b>) SST, (<b>d</b>) WS, and (<b>e</b>) RH in June 2023.</p>
Full article ">Figure 3
<p>Spatial distributions of calculated EDH in June 2023.</p>
Full article ">Figure 4
<p>Schematic of the EDH-STNet model.</p>
Full article ">Figure 5
<p>Partial EDH distribution in (<b>a</b>) Test2022 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 5 Cont.
<p>Partial EDH distribution in (<b>a</b>) Test2022 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 6
<p>Partial EDH distribution in (<b>a</b>) Test2023 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 6 Cont.
<p>Partial EDH distribution in (<b>a</b>) Test2023 and prediction results for the models: (<b>b</b>) Unet, (<b>c</b>) Swin-Transformer, (<b>d</b>) Swin-Unet, (<b>e</b>) SwinUnet-5, and (<b>f</b>) EDH-STNet.</p>
Full article ">Figure 7
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2022.</p>
Full article ">Figure 7 Cont.
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2022.</p>
Full article ">Figure 8
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2023.</p>
Full article ">Figure 8 Cont.
<p>Partial absolute prediction errors of the (<b>a</b>) Unet, (<b>b</b>) Swin-Transformer, (<b>c</b>) Swin-Unet, (<b>d</b>) SwinUnet-5, and (<b>e</b>) EDH-STNet models on Test2023.</p>
Full article ">Figure 9
<p>Predictions of all models for measured EDH.</p>
Full article ">
20 pages, 23966 KiB  
Article
FCSwinU: Fourier Convolutions and Swin Transformer UNet for Hyperspectral and Multispectral Image Fusion
by Rumei Li, Liyan Zhang, Zun Wang and Xiaojuan Li
Sensors 2024, 24(21), 7023; https://doi.org/10.3390/s24217023 - 31 Oct 2024
Viewed by 496
Abstract
The fusion of low-resolution hyperspectral images (LR-HSI) with high-resolution multispectral images (HR-MSI) provides a cost-effective approach to obtaining high-resolution hyperspectral images (HR-HSI). Existing methods primarily based on convolutional neural networks (CNNs) struggle to capture global features and do not adequately address the significant [...] Read more.
The fusion of low-resolution hyperspectral images (LR-HSI) with high-resolution multispectral images (HR-MSI) provides a cost-effective approach to obtaining high-resolution hyperspectral images (HR-HSI). Existing methods primarily based on convolutional neural networks (CNNs) struggle to capture global features and do not adequately address the significant scale and spectral resolution differences between LR-HSI and HR-MSI. To tackle these challenges, our novel FCSwinU network leverages the spectral fast Fourier convolution (SFFC) module for spectral feature extraction and utilizes the Swin Transformer’s self-attention mechanism for multi-scale global feature fusion. FCSwinU employs a UNet-like encoder–decoder framework to effectively merge spatiospectral features. The encoder integrates the Swin Transformer feature abstraction module (SwinTFAM) to encode pixel correlations and perform multi-scale transformations, facilitating the adaptive fusion of hyperspectral and multispectral data. The decoder then employs the Swin Transformer feature reconstruction module (SwinTFRM) to reconstruct the fused features, restoring the original image dimensions and ensuring the precise recovery of spatial and spectral details. Experimental results from three benchmark datasets and a real-world dataset robustly validate the superior performance of our method in both visual representation and quantitative assessment compared to existing fusion methods. Full article
(This article belongs to the Section Remote Sensors)
Show Figures

Figure 1

Figure 1
<p>The overall architecture diagram of the proposed FCSwinU network.</p>
Full article ">Figure 2
<p>The architecture of the SFFC module.</p>
Full article ">Figure 3
<p>Schematic diagram of the Swin Transformer feature abstraction and reconstruction module: (<b>a</b>) represents the structure of the Swin Transformer; (<b>b</b>) represents the Swin Transformer feature abstraction module; (<b>c</b>) represents the Swin Transformer feature reconstruction module.</p>
Full article ">Figure 4
<p>Illustrates the fusion results of the “fake and real apple” image from the CAVE dataset. The first row presents a false-color image synthesized from the 29th, 19th, and 9th spectral bands. The second row visualizes the SAM map. The third row depicts the error images between the fused and the actual images.</p>
Full article ">Figure 5
<p>Fusion results of the “watercolors” Image from the CAVE Dataset. The first row presents a false-color image synthesized from the 67th, 27th, and 17th spectral bands. The second row visualizes the SAM map. The third row depicts the error images between the fused and the original images.</p>
Full article ">Figure 6
<p>Fusion results of Image 1 from the WDCM test set. The first row presents a false-color image synthesized from the 67th, 27th, and 17th spectral bands. The second row visualizes the SAM map. The third row depicts the error images between the fused and the original images.</p>
Full article ">Figure 7
<p>Fusion results of Image 2 from the WDCM test set. The first row presents a false-color image synthesized from the 67th, 27th, and 17th spectral bands. The second row visualizes the SAM map. The third row depicts the error comparison between the fused and the original images.</p>
Full article ">Figure 8
<p>Fusion results of Image 1 from the PU test set. The first row presents a false-color image synthesized from the 29th, 19th, and 9th spectral bands. The second row visualizes the SAM map. The third row depicts the error comparison between the fused and the original images.</p>
Full article ">Figure 9
<p>Fusion results of Image 2 from the PU test set. The first row presents a false-color image synthesized from the 29th, 19th, and 9th spectral bands. The second row visualizes the SAM map. The third row depicts the error comparison between the fused and the original images.</p>
Full article ">Figure 10
<p>Average PSNR comparison of various fusion algorithms across different spectral bands in (<b>a</b>) CAVE, (<b>b</b>) WDCM, and (<b>c</b>) PU datasets.</p>
Full article ">Figure 11
<p>Qualitative result 1 of the ZY-1E dataset. We show the composite image of the MSI with bands 3-2-1 as R-G-B and HSI with bands 29-19-10 as R-G-B. reconstructed images by 7 comparison methods.</p>
Full article ">Figure 12
<p>Spectral contrast of different objects in the ZY-1E dataset.</p>
Full article ">
21 pages, 4755 KiB  
Article
MIMO-Uformer: A Transformer-Based Image Deblurring Network for Vehicle Surveillance Scenarios
by Jian Zhang, Baoping Cheng, Tengying Zhang, Yongsheng Zhao, Tao Fu, Zijian Wu and Xiaoming Tao
J. Imaging 2024, 10(11), 274; https://doi.org/10.3390/jimaging10110274 - 31 Oct 2024
Viewed by 487
Abstract
Motion blur is a common problem in the field of surveillance scenarios, and it obstructs the acquisition of valuable information. Thanks to the success of deep learning, a sequence of CNN-based architecture has been designed for image deblurring and has made great progress. [...] Read more.
Motion blur is a common problem in the field of surveillance scenarios, and it obstructs the acquisition of valuable information. Thanks to the success of deep learning, a sequence of CNN-based architecture has been designed for image deblurring and has made great progress. As another type of neural network, transformers have exhibited powerful deep representation learning and impressive performance based on high-level vision tasks. Transformer-based networks leverage self-attention to capture the long-range dependencies in the data, yet the computational complexity is quadratic to the spatial resolution, which makes transformers infeasible for the restoration of high-resolution images. In this article, we propose an efficient transformer-based deblurring network, named MIMO-Uformer, for vehicle-surveillance scenarios. The distinct feature of the MIMO-Uformer is that the basic-window-based multi-head self-attention (W-MSA) of the Swin transformer is employed to reduce the computational complexity and then incorporated into a multi-input and multi-output U-shaped network (MIMO-UNet). The performance can benefit from the operation of multi-scale images by MIMO-UNet. However, most deblurring networks are designed for global blur, while local blur is more common under vehicle-surveillance scenarios since the motion blur is primarily caused by local moving vehicles. Based on this observation, we further propose an Intersection over Patch (IoP) factor and a supervised morphological loss to improve the performance based on local blur. Extensive experiments on a public and a self-established dataset are carried out to verify the effectiveness. As a result, the deblurring behavior based on PSNR is improved at least 0.21 dB based on GOPRO and 0.74 dB based on the self-established datasets compared to the existing benchmarks. Full article
Show Figures

Figure 1

Figure 1
<p>The structures of coarse-to-fine networks: (<b>a</b>) DeepDeblur. (<b>b</b>) PSS-NSC. (<b>c</b>) MIMO-UNet.</p>
Full article ">Figure 2
<p>The illustration of Vision Transformer and Transformer Encoder.</p>
Full article ">Figure 3
<p>The architecture of the proposed deblurring method for vehicle-surveillance scenarios. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 4
<p>Regular local window partition and shifted window partition.</p>
Full article ">Figure 5
<p>The illustration of the cropped patches.</p>
Full article ">Figure 6
<p>The illustration of the Intersection over Patch (IoP).</p>
Full article ">Figure 7
<p>Examples of the binarization plates. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 8
<p>Examples of the sharp and blurry image pairs. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 9
<p>Examples of the deblurring results based on GOPRO. For clarity, the magnified parts of the resultant images are presented. The subfigures are the restorations of crowded people, traffic and roadside (<b>a</b>). From left-top to right-bottom: blur images, ground-truth images, and the resultant images obtained using DeblurGan, DeepDeblur, MT-RNN, MPRNet, MIMO-UNet+, and MIMO-Uformer, respectively. The Korean characters means the name of the shop in (<b>b</b>), and the region and usage of the vehicle in (<b>c</b>).</p>
Full article ">Figure 10
<p>Examples of the deblurring results based on the self-established dataset. For clarity, the magnified parts of the resultant images are presented. The zoomed-in areas of subfigures are the restoration of lamp (<b>a</b>) and grille (<b>b</b>). From left-top to right-bottom: blur images, ground-truth images, and the resultant images obtained using DeepDeblur, MIMO-UNet+, and MIMO-Uformer-Veh, respectively. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 10 Cont.
<p>Examples of the deblurring results based on the self-established dataset. For clarity, the magnified parts of the resultant images are presented. The zoomed-in areas of subfigures are the restoration of lamp (<b>a</b>) and grille (<b>b</b>). From left-top to right-bottom: blur images, ground-truth images, and the resultant images obtained using DeepDeblur, MIMO-UNet+, and MIMO-Uformer-Veh, respectively. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 11
<p>Examples of the deblurring results based on the self-established dataset. From left to right are the resultant images obtained using DeepDeblur, MIMO-UNet+, and MIMO-Uformer-Veh, respectively. The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">Figure 12
<p>Examples of the deblurring results based on the self-established dataset. From left-top to right-bottom are the ground-truth image and resultant images obtained using MIMO-UNet+ (+R), MIMO-UNet+ (+I), and MIMO-Uformer (+I), respectively.</p>
Full article ">Figure 13
<p>The subfigures display two deblurring results of two different license plates. In the first example, the ground-truth is illustrated in the left section, and the results are presented in the right section, of which the first and second row are the results of MIMO-UNet and MIMO-UNet (+M), respectively (<b>a</b>). In the second example, the ground-truth is illustrated in the left section, and the results are presented in the left section, of which the first row is the results of MIMO-UNet and MIMO-Uformer, and the second row is the results of MIMO-UNet (+M) and MIMO-Uformer (+M), from left to right (<b>b</b>). The Chinese character on the plates means the province where the vehicle is registered.</p>
Full article ">
19 pages, 6837 KiB  
Article
A Classification and Segmentation Model for Diamond Abrasive Grains Based on Improved Swin-Unet-SAM
by Yanfen Lin, Tinghao Fan and Congfu Fang
Electronics 2024, 13(21), 4213; https://doi.org/10.3390/electronics13214213 - 27 Oct 2024
Viewed by 521
Abstract
The detection of abrasive grain images in diamond tools serves as the foundation for assessing the overall condition of the tools, encompassing crucial aspects of diamond abrasive grains like the quantity, size, morphology, and distribution. Given the intricate background textures and reflective characteristics [...] Read more.
The detection of abrasive grain images in diamond tools serves as the foundation for assessing the overall condition of the tools, encompassing crucial aspects of diamond abrasive grains like the quantity, size, morphology, and distribution. Given the intricate background textures and reflective characteristics exhibited by diamond images, diamond detection and segmentation pose a significant challenge. Recently, numerous defect detection methods based on machine learning and deep learning have emerged. However, several issues persist, such as detection accuracy and the interference caused by intricate background textures. The present work demonstrates an efficient classification and segmentation network algorithm that combines Swin-Unet with SAM (Segment Anything Model) to alleviate the existing problems. Specifically, four embedding structures were devised to bridge the two models for iterative training. The transformer blocks within the Swin-Unet model were enhanced to facilitate classification and coarse segmentation, and the mask structure in SAM was refined to enable fine segmentation. The experimental results show that under a small sample dataset with complex background textures, the average index values of ACC (accuracy), SE (Sensitivity), and DSC (Dice Similarity Coefficient) for the classification and segmentation of diamond abrasive grains reached 98.7%, 92.5%, and 85.9%, respectively. Compared with the model before improvement, its ACC, SE and DSC increased by 1.2%, 15.9%, and 7.6%, respectively. The test results, based on four different datasets, consistently indicated that this model has excellent segmentation performance and robustness and has great application potential in the industrial field. Full article
(This article belongs to the Special Issue New Insights in 2D and 3D Object Detection and Semantic Segmentation)
Show Figures

Figure 1

Figure 1
<p>The Swin-Unet network structure.</p>
Full article ">Figure 2
<p>The SAM network structure [<a href="#B37-electronics-13-04213" class="html-bibr">37</a>].</p>
Full article ">Figure 3
<p>The architecture of proposed Swin-Unet-SAM network model.</p>
Full article ">Figure 4
<p>The single Swin Transformer Block in the Bottleneck (<b>left</b>) and the meaning of the number of attention heads (<b>right</b>).</p>
Full article ">Figure 5
<p>The structure of loop iteration.</p>
Full article ">Figure 6
<p>Typical abrasive grain categories in the abrasive grain dataset. Explanatory descriptions of abrasive grains in the dataset: (<b>first column</b>), whole category; (<b>second column</b>), micro-broken category; (<b>third column</b>), macro-broken category; (<b>fourth column</b>), fall-off category.</p>
Full article ">Figure 7
<p>Typical annotation results of different categories of abrasive grains in dataset.</p>
Full article ">Figure 8
<p>The proposed method compared with other state of the art methods. Analysis of detection results: (<b>first column</b>), original image; (<b>second column</b>), ground truth; (<b>third column</b>), Unet detection results; (<b>fourth column</b>), TransUnet detection results; (<b>fifth column</b>), Swin-Unet detection results; (<b>sixth</b> to <b>eighth column</b>), Swin series detection results; (<b>ninth column</b>), detection results of the proposed method in this paper. In the figure, the green mark is for the whole category, the yellow mark for the micro-broken category, the red mark for the macro-broken category, and the white mark for the fall-off category. The results indicate that our method is superior to others in classifying and segmenting abrasive grains with different degrees of wear under a complex background.</p>
Full article ">Figure 9
<p>Evaluation of different models. Swin-Unet refers to the original Swin-Unet; Our(0) represents our proposed model with the improved structure without the CE-Dice loss; Our(1) refers to the proposed model with the improved structure without the CE-Dice loss. The results show that our model outperforms Swin-Unet in terms of <span class="html-italic">ACC</span>, <span class="html-italic">SE</span>, and <span class="html-italic">DSC</span>. This is because after adding relevant structures and loss function, the model has a faster convergence speed and more abundant learning ability of wear features.</p>
Full article ">Figure 10
<p>Prompts of the points, box and mask.</p>
Full article ">
20 pages, 10555 KiB  
Article
Cloud Detection Using a UNet3+ Model with a Hybrid Swin Transformer and EfficientNet (UNet3+STE) for Very-High-Resolution Satellite Imagery
by Jaewan Choi, Doochun Seo, Jinha Jung, Youkyung Han, Jaehong Oh and Changno Lee
Remote Sens. 2024, 16(20), 3880; https://doi.org/10.3390/rs16203880 - 18 Oct 2024
Viewed by 591
Abstract
It is necessary to extract and recognize the cloud regions presented in imagery to generate satellite imagery as analysis-ready data (ARD). In this manuscript, we proposed a new deep learning model to detect cloud areas in very-high-resolution (VHR) satellite imagery by fusing two [...] Read more.
It is necessary to extract and recognize the cloud regions presented in imagery to generate satellite imagery as analysis-ready data (ARD). In this manuscript, we proposed a new deep learning model to detect cloud areas in very-high-resolution (VHR) satellite imagery by fusing two deep learning architectures. The proposed UNet3+ model with a hybrid Swin Transformer and EfficientNet (UNet3+STE) was based on the structure of UNet3+, with the encoder sequentially combining EfficientNet based on mobile inverted bottleneck convolution (MBConv) and the Swin Transformer. By sequentially utilizing convolutional neural networks (CNNs) and transformer layers, the proposed algorithm aimed to extract the local and global information of cloud regions effectively. In addition, the decoder used MBConv to restore the spatial information of the feature map extracted by the encoder and adopted the deep supervision strategy of UNet3+ to enhance the model’s performance. The proposed model was trained using the open dataset derived from KOMPSAT-3 and 3A satellite imagery and conducted a comparative evaluation with the state-of-the-art (SOTA) methods on fourteen test datasets at the product level. The experimental results confirmed that the proposed UNet3+STE model outperformed the SOTA methods and demonstrated the most stable precision, recall, and F1 score values with fewer parameters and lower complexity. Full article
(This article belongs to the Special Issue Deep Learning Techniques Applied in Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>Examples of images contained in the training dataset: satellite images (<b>top</b>) and labeled reference data (<b>bottom</b>) (black: clear skies; red: thick and thin clouds; green: cloud shadows).</p>
Full article ">Figure 2
<p>Test datasets for evaluating the performance of deep learning models (black: clear skies; red: thick clouds; green: thin clouds; yellow: cloud shadows).</p>
Full article ">Figure 3
<p>Architecture of UNet3+.</p>
Full article ">Figure 4
<p>Architecture of the proposed UNet3+STE model (where <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">E</mi> <mo>=</mo> <mo>[</mo> <msub> <mrow> <mi>E</mi> </mrow> <mrow> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>E</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>E</mi> </mrow> <mrow> <mn>3</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>E</mi> </mrow> <mrow> <mn>4</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>E</mi> </mrow> <mrow> <mn>5</mn> </mrow> </msub> <mo>]</mo> </mrow> </semantics></math> contains the feature map of each encoder stage and <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">D</mi> <mo>=</mo> <mo>[</mo> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mn>3</mn> </mrow> </msub> <mo>,</mo> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mn>4</mn> </mrow> </msub> <mo>]</mo> </mrow> </semantics></math> includes the feature map of each decoder stage).</p>
Full article ">Figure 5
<p>Structure of the encoder part.</p>
Full article ">Figure 6
<p>Structures of the MBConvs in UNet3+STE.</p>
Full article ">Figure 7
<p>Structure of the Swin Transformer layer.</p>
Full article ">Figure 8
<p>Examples of structures for calculating <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msub> </mrow> </semantics></math> in the decoder part.</p>
Full article ">Figure 9
<p>Deep supervision structures in the decoder part.</p>
Full article ">Figure 10
<p>Precision, recall, and F1 scores for each class.</p>
Full article ">Figure 11
<p>Cloud detection results produced for high-spatial-resolution (<math display="inline"><semantics> <mrow> <mn>5965</mn> <mo>×</mo> <mn>6317</mn> </mrow> </semantics></math>) images at the product level (black: clear skies; red: thick and thin clouds; green: cloud shadows).</p>
Full article ">Figure 12
<p>First-subset images (<math display="inline"><semantics> <mrow> <mn>2000</mn> <mo>×</mo> <mn>2000</mn> </mrow> </semantics></math>) of the cloud detection results produced for high-spatial-resolution (<math display="inline"><semantics> <mrow> <mn>5965</mn> <mo>×</mo> <mn>5720</mn> </mrow> </semantics></math>) images at the product level.</p>
Full article ">Figure 13
<p>Second-subset images (<math display="inline"><semantics> <mrow> <mn>2000</mn> <mo>×</mo> <mn>2000</mn> </mrow> </semantics></math>) of the cloud detection results produced for high-spatial-resolution (<math display="inline"><semantics> <mrow> <mn>5965</mn> <mo>×</mo> <mn>5073</mn> </mrow> </semantics></math>) images at the product level.</p>
Full article ">
12 pages, 7654 KiB  
Article
Memorizing Swin-Transformer Denoising Network for Diffusion Model
by Jindou Chen and Yiqing Shen
Electronics 2024, 13(20), 4050; https://doi.org/10.3390/electronics13204050 - 15 Oct 2024
Viewed by 591
Abstract
Diffusion models have garnered significant attention in the field of image generation. However, existing denoising architectures, such as U-Net, face limitations in capturing the global context, while Vision Transformers (ViTs) may struggle with local receptive fields. To address these challenges, we propose a [...] Read more.
Diffusion models have garnered significant attention in the field of image generation. However, existing denoising architectures, such as U-Net, face limitations in capturing the global context, while Vision Transformers (ViTs) may struggle with local receptive fields. To address these challenges, we propose a novel Swin-Transformer-based denoising network architecture that leverages the strengths of both U-Net and ViT. Moreover, our approach integrates the k-Nearest Neighbor (kNN) based memorizing attention module into the Swin-Transformer, enabling it to effectively harness crucial contextual information from feature maps and enhance its representational capacity. Finally, we introduce an innovative hierarchical time stream embedding scheme that optimizes the incorporation of temporal cues during the denoising process. This method surpasses basic approaches like simple addition or concatenation of fixed time embeddings, facilitating a more effective fusion of temporal information. Extensive experiments conducted on four benchmark datasets demonstrate the superior performance of our proposed model compared to U-Net and ViT as denoising networks. Our model outperforms baselines on the CRC-VAL-HE-7K and CelebA datasets, achieving improved FID scores of 14.39 and 4.96, respectively, and even surpassing DiT and UViT under our experiment setting. The Memorizing Swin-Transformer architecture, coupled with the hierarchical time stream embedding, sets a new state-of-the-art in denoising diffusion models for image generation. Full article
Show Figures

Figure 1

Figure 1
<p>The overall architecture of the proposed Memorizing Denoising Swin-Transformer network. The network consists of an <span style="background:#DBA901">Encoder</span>, a <span style="background:#FFB6C1">Bottleneck</span>, and a <span style="background:#FFA07A">Decoder</span>, each utilizing Swin-Transformer blocks to capture both local and global features. The kNN-based memorizing attention module (left) enhances the model’s representational capacity and interpretability by storing and retrieving relevant contextual information from feature maps. The hierarchical time stream embedding scheme (represented by <span style="color: #FF0000">red arrows</span>) optimizes the incorporation of temporal cues into the denoising process, facilitating a more effective fusion of time-step information.</p>
Full article ">Figure 2
<p>A schematic illustration of the proposed hierarchical time stream embedding scheme implemented on one Swin-Transformer block.</p>
Full article ">Figure 3
<p>Representative generated images for the CIFAR-10, Orchid Flowers, CRC-VAL-HE, and CelebA datasets are located in the upper left, lower left, center, and right of the figure, respectively.</p>
Full article ">Figure 4
<p>Ablation study examining the impact of the kNN-based memorizing attention block on convergence speed.</p>
Full article ">Figure 5
<p>Interpretability demonstration. The left side is the generated image and the right side is the heatmap corresponding to one class of cached feature representation.</p>
Full article ">
19 pages, 6172 KiB  
Article
MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method
by Haoyuan Chen, Yufei Han, Linwei Yao, Xin Wu, Kuan Li and Jianping Yin
Mathematics 2024, 12(19), 2996; https://doi.org/10.3390/math12192996 - 26 Sep 2024
Viewed by 995
Abstract
Traditional U-shape segmentation models can achieve excellent performance with an elegant structure. However, the single-layer decoder structure of U-Net or SwinUnet is too “thin” to exploit enough information, resulting in large semantic differences between the encoder and decoder parts. Things get worse in [...] Read more.
Traditional U-shape segmentation models can achieve excellent performance with an elegant structure. However, the single-layer decoder structure of U-Net or SwinUnet is too “thin” to exploit enough information, resulting in large semantic differences between the encoder and decoder parts. Things get worse in the field of medical image processing, where annotated data are more difficult to obtain than other tasks. Based on this observation, we propose a U-like model named MS-UNet with a plug-and-play adaptive denoising module and ELoss for the medical image segmentation task in this study. Instead of the single-layer U-Net decoder structure used in Swin-UNet and TransUNet, we specifically designed a multi-scale nested decoder based on the Swin Transformer for U-Net. The proposed multi-scale nested decoder structure allows for the feature mapping between the decoder and encoder to be semantically closer, thus enabling the network to learn more detailed features. In addition, ELoss could improve the attention of the model to the segmentation edges, and the plug-and-play adaptive denoising module could prevent the model from learning the wrong features without losing detailed information. The experimental results show that MS-UNet could effectively improve network performance with more efficient feature learning capability and exhibit more advanced performance, especially in the extreme case with a small amount of training data. Furthermore, the proposed ELoss and denoising module not only significantly enhance the segmentation performance of MS-UNet but can also be applied individually to other models. Full article
(This article belongs to the Special Issue Machine-Learning-Based Process and Analysis of Medical Images)
Show Figures

Figure 1

Figure 1
<p>The architecture of MS-UNet and the plug-and-play adaptive denoising (PAD) module. Our contributions: ❶ The MS-UNet is composed of pure Transformer modules; ❷ Multi-Scale Nested Decoder; ❸ the plug-and-play adaptive denoising (PAD) module consists of two trainable denoising modules.</p>
Full article ">Figure 2
<p>The calculation process for the edge loss score in ELoss. The result of the model prediction is multiplied element by element with the edge ground truth generated by the edge label generation method (introduced in <a href="#sec3dot3-mathematics-12-02996" class="html-sec">Section 3.3</a>) to obtain the corresponding edge loss score.</p>
Full article ">Figure 3
<p>The architecture of the plug-and-play adaptive denoising module. The module generates noise reduction maps from two different trainable denoising modules and computes them with the input image to obtain the final noise reduction image as model input.</p>
Full article ">Figure 4
<p>Visualization of the segmentation effects of different models on the Synapse multi-organ CT dataset. Among them, our proposed MS-UNet outperforms other baseline models in both global and detailed segmentation.</p>
Full article ">Figure 5
<p>Visualization of the segmentation effects of different combinations of methods on the Synapse multi-organ CT dataset (<b>b</b>) is the segmentation result with MS-UNet only. (<b>c</b>) is the segmentation result with MS-UNet and ELoss. (<b>d</b>) is the segmentation result with the use of MS-UNet, ELoss, and the plug-and-play adaptive denoising module.</p>
Full article ">Figure 6
<p>The DSC score and HD score for different ratios of the Synapse, ACDC, and JSRT test datasets to MS-UNet and typical baseline models show that our proposed MS-UNet achieves the best segmentation performance, even in extreme cases using part of the training data. In this case, the DSC scores are better when they are higher in line graphs, and HD scores are better when they are lower in the bar graphs.</p>
Full article ">Figure 7
<p>The architecture of MS-UNet with different numbers and locations of the multi-scale nested decoder blocks.</p>
Full article ">
25 pages, 17970 KiB  
Article
A New Subject-Sensitive Hashing Algorithm Based on Multi-PatchDrop and Swin-Unet for the Integrity Authentication of HRRS Image
by Kaimeng Ding, Yingying Wang, Chishe Wang and Ji Ma
ISPRS Int. J. Geo-Inf. 2024, 13(9), 336; https://doi.org/10.3390/ijgi13090336 - 21 Sep 2024
Viewed by 582
Abstract
Transformer-based subject-sensitive hashing algorithms exhibit good integrity authentication performance and have the potential to ensure the authenticity and convenience of high-resolution remote sensing (HRRS) images. However, the robustness of Transformer-based subject-sensitive hashing is still not ideal. In this paper, we propose a Multi-PatchDrop [...] Read more.
Transformer-based subject-sensitive hashing algorithms exhibit good integrity authentication performance and have the potential to ensure the authenticity and convenience of high-resolution remote sensing (HRRS) images. However, the robustness of Transformer-based subject-sensitive hashing is still not ideal. In this paper, we propose a Multi-PatchDrop mechanism to improve the performance of Transformer-based subject-sensitive hashing. The Multi-PatchDrop mechanism determines different patch dropout values for different Transformer blocks in ViT models. On the basis of a Multi-PatchDrop, we propose an improved Swin-Unet for implementing subject-sensitive hashing. In this improved Swin-Unet, Multi-PatchDrop has been integrated, and each Swin Transformer block (except the first one) is preceded by a patch dropout layer. Experimental results demonstrate that the robustness of our proposed subject-sensitive hashing algorithm is not only stronger than that of the CNN-based algorithms but also stronger than that of Transformer-based algorithms. The tampering sensitivity is of the same intensity as the AGIM-net and M-net-based algorithms, stronger than other Transformer-based algorithms. Full article
Show Figures

Figure 1

Figure 1
<p>Instances of HRRS images being tampered with. (<b>a</b>) Original HRRS images. (<b>b</b>) Tampered images.</p>
Full article ">Figure 2
<p>Architecture of improved Swin-Unet based on Multi-Patchdrop.</p>
Full article ">Figure 3
<p>The proposed subject-sensitive hashing algorithm.</p>
Full article ">Figure 4
<p>Instances of Integrity Authentication: (<b>a</b>) Original image; (<b>b</b>) JPEG compression; (<b>c</b>) Format conversion; (<b>d</b>) Watermark embedding; (<b>e</b>) Subject-unrelated tampering; (<b>f</b>) Subject-related tampering 1 (Add a building); (<b>g</b>) Subject-related tampering 2 (Delete a building); (<b>h</b>) Random smear tampering.</p>
Full article ">Figure 5
<p>Robustness of each algorithm to JPEG (based on the Inria Dataset).</p>
Full article ">Figure 6
<p>Robustness of each algorithm to JPEG (based on WHU the dataset).</p>
Full article ">Figure 7
<p>Robustness of each algorithm to invisible watermark (based on the Inria dataset).</p>
Full article ">Figure 8
<p>Robustness of each algorithm to invisible watermark (based on the WHU dataset).</p>
Full article ">Figure 9
<p>Robustness of each algorithm to 4-pixel modifications (based on the Inria dataset).</p>
Full article ">Figure 10
<p>Robustness of each algorithm to 4-pixel modifications (based on the WHU dataset).</p>
Full article ">Figure 11
<p>Examples of 24 × 24-pixel subject-related tampering. (<b>a</b>) Original HRRS images. (<b>b</b>) Tampered images.</p>
Full article ">Figure 12
<p>Tampering sensitivity test for 24 × 24-pixel subject-related tampering (based on the Inria dataset).</p>
Full article ">Figure 13
<p>Tampering sensitivity test for 24 × 24-pixel subject-related tampering (based on the WHU dataset).</p>
Full article ">Figure 14
<p>Examples of 16 × 16-pixel subject-related tampering. (<b>a</b>) Original HRRS images. (<b>b</b>) Tampered images.</p>
Full article ">Figure 15
<p>Tampering sensitivity testing for 16 × 16-pixel subject-related tampering (based on the Inria dataset).</p>
Full article ">Figure 16
<p>Tampering sensitivity testing for 16 × 16-pixel subject-related tampering (based on the WHU dataset).</p>
Full article ">Figure 17
<p>Examples of subject-related tampering of deleting buildings. (<b>a</b>) Original HRRS images. (<b>b</b>) Tampered images.</p>
Full article ">Figure 18
<p>Tampering sensitivity testing for deletion of buildings (based on the Inria dataset).</p>
Full article ">Figure 19
<p>Tampering sensitivity testing for deletion of buildings (based on the WUH dataset).</p>
Full article ">Figure 20
<p>Examples of subject-related tampering of added buildings. (<b>a</b>) Original HRRS images. (<b>b</b>) Tampered images.</p>
Full article ">Figure 21
<p>Tampering sensitivity testing for added buildings (based on the Inria dataset).</p>
Full article ">Figure 22
<p>Tampering sensitivity testing for added buildings (based on the WHU dataset).</p>
Full article ">
23 pages, 36403 KiB  
Article
DSC-Net: Enhancing Blind Road Semantic Segmentation with Visual Sensor Using a Dual-Branch Swin-CNN Architecture
by Ying Yuan, Yu Du, Yan Ma and Hejun Lv
Sensors 2024, 24(18), 6075; https://doi.org/10.3390/s24186075 - 20 Sep 2024
Viewed by 674
Abstract
In modern urban environments, visual sensors are crucial for enhancing the functionality of navigation systems, particularly for devices designed for visually impaired individuals. The high-resolution images captured by these sensors form the basis for understanding the surrounding environment and identifying key landmarks. However, [...] Read more.
In modern urban environments, visual sensors are crucial for enhancing the functionality of navigation systems, particularly for devices designed for visually impaired individuals. The high-resolution images captured by these sensors form the basis for understanding the surrounding environment and identifying key landmarks. However, the core challenge in the semantic segmentation of blind roads lies in the effective extraction of global context and edge features. Most existing methods rely on Convolutional Neural Networks (CNNs), whose inherent inductive biases limit their ability to capture global context and accurately detect discontinuous features such as gaps and obstructions in blind roads. To overcome these limitations, we introduce Dual-Branch Swin-CNN Net(DSC-Net), a new method that integrates the global modeling capabilities of the Swin-Transformer with the CNN-based U-Net architecture. This combination allows for the hierarchical extraction of both fine and coarse features. First, the Spatial Blending Module (SBM) mitigates blurring of target information caused by object occlusion to enhance accuracy. The hybrid attention module (HAM), embedded within the Inverted Residual Module (IRM), sharpens the detection of blind road boundaries, while the IRM improves the speed of network processing. In tests on a specialized dataset designed for blind road semantic segmentation in real-world scenarios, our method achieved an impressive mIoU of 97.72%. Additionally, it demonstrated exceptional performance on other public datasets. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) CNN-based methods excel at handling detailed information but struggle to capture long-range dependencies. They have difficulty understanding context when external conditions change significantly. (<b>b</b>) In contrast, transformer-based methods lead to unclear edge information in the results. (<b>c</b>) DSC-Net includes both a CNN-based branch and a transformer-based branch. This design effectively addresses both context and edge details.</p>
Full article ">Figure 2
<p>Overview of DSC-Net. An encoder–decoder structure with skip connections is employed, which establishes the relationship between the encoder and decoder. The encoder incorporates a transformer-based global context branch and a CNN-based detail branch, processing images and capturing multi-scale information. These branches are merged and upsampled by the decoder to generate segmentation outputs.</p>
Full article ">Figure 3
<p>The structure of Spatial Blending Module (SBM). Statistical features are captured along horizontal and vertical directions, reshaped through matrix multiplication. Finally, they are integrated with the input features. SBM can further enhance the interaction of global contextual information.</p>
Full article ">Figure 4
<p>The structure of the Inverted Residual Module (IRM) is designed to accelerate computation speed. The number of image channels is expanded to extract features from each channel. The channel count is then reduced back to the original.</p>
Full article ">Figure 5
<p>The structure of the hybrid attention module (HAM). Input features are processed through channel and spatial attention branches. Global pooling, multilayer perceptrons, and convolutions extract key channel and spatial features. These features are then integrated with the input features to produce the output features. HAM focuses more on the edge information of occlusions.</p>
Full article ">Figure 6
<p>Comparison of semantic segmentation results from Cityscapes dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) Deeplabv3. (<b>d</b>) Swin-Transformer. (<b>e</b>) TransUnet. (<b>f</b>) ViT. (<b>g</b>) DSC-Net. The rectangles highlight areas where our approach exhibits superior performance. DSC-Net delivers enhanced edge precision for objects including poles, traffic signs, and motorcycles.</p>
Full article ">Figure 7
<p>Comparison of an enlarged view of the results from the Cityscapes dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) Deeplabv3. (<b>d</b>) Swin-Transformer. (<b>e</b>) TransUnet. (<b>f</b>) ViT. (<b>g</b>) DSC-Net.</p>
Full article ">Figure 8
<p>Comparison of semantic segmentation results from Blind Roads and Crosswalks dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) SEM_FPN. (<b>d</b>) Deeplabv3. (<b>e</b>) Swin-Transformer. (<b>f</b>) TransUnet. (<b>g</b>) ViT-large. (<b>h</b>) DSC-Net. The rectangles highlight areas where our approach exhibits superior performance. DSC-Net precisely discerns horizontal blind roads and crosswalks. Additionally, it demonstrates enhanced accuracy on discontinuous vertical blind roads.</p>
Full article ">Figure 9
<p>Comparison of an enlarged view of the results from the Blind Roads and Crosswalks dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) SEM_FPN. (<b>d</b>) Deeplabv3. (<b>e</b>) Swin-Transformer. (<b>f</b>) TransUnet. (<b>g</b>) ViT. (<b>h</b>) DSC-Net.</p>
Full article ">Figure 10
<p>Comparison of semantic segmentation results from Blind Roads dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) Deeplabv3. (<b>d</b>) Swin-Transformer. (<b>e</b>) TransUnet. (<b>f</b>) ViT. (<b>g</b>) DSC-Net. The rectangles highlight areas where our approach exhibits superior performance. DSC-Net sustains improved contextual relationships on discontinuous blind roads and delivers more distinct edges in the presence of obstructions.</p>
Full article ">Figure 11
<p>Comparison of an enlarged view of the results from the Blind Roads dataset. (<b>a</b>) U-Net. (<b>b</b>) Bisenetv1. (<b>c</b>) Deeplabv3. (<b>d</b>) Swin-Transformer. (<b>e</b>) TransUnet. (<b>f</b>) ViT. (<b>g</b>) DSC-Net.</p>
Full article ">
15 pages, 3249 KiB  
Article
The InterVision Framework: An Enhanced Fine-Tuning Deep Learning Strategy for Auto-Segmentation in Head and Neck
by Byongsu Choi, Chris J. Beltran, Sang Kyun Yoo, Na Hye Kwon, Jin Sung Kim and Justin Chunjoo Park
J. Pers. Med. 2024, 14(9), 979; https://doi.org/10.3390/jpm14090979 - 15 Sep 2024
Viewed by 521
Abstract
Adaptive radiotherapy (ART) workflows are increasingly adopted to achieve dose escalation and tissue sparing under dynamic anatomical conditions. However, recontouring and time constraints hinder the implementation of real-time ART workflows. Various auto-segmentation methods, including deformable image registration, atlas-based segmentation, and deep learning-based segmentation [...] Read more.
Adaptive radiotherapy (ART) workflows are increasingly adopted to achieve dose escalation and tissue sparing under dynamic anatomical conditions. However, recontouring and time constraints hinder the implementation of real-time ART workflows. Various auto-segmentation methods, including deformable image registration, atlas-based segmentation, and deep learning-based segmentation (DLS), have been developed to address these challenges. Despite the potential of DLS methods, clinical implementation remains difficult due to the need for large, high-quality datasets to ensure model generalizability. This study introduces an InterVision framework for segmentation. The InterVision framework can interpolate or create intermediate visuals between existing images to generate specific patient characteristics. The InterVision model is trained in two steps: (1) generating a general model using the dataset, and (2) tuning the general model using the dataset generated from the InterVision framework. The InterVision framework generates intermediate images between existing patient image slides using deformable vectors, effectively capturing unique patient characteristics. By creating a more comprehensive dataset that reflects these individual characteristics, the InterVision model demonstrates the ability to produce more accurate contours compared to general models. Models are evaluated using the volumetric dice similarity coefficient (VDSC) and the Hausdorff distance 95% (HD95%) for 18 structures in 20 test patients. As a result, the Dice score was 0.81 ± 0.05 for the general model, 0.82 ± 0.04 for the general fine-tuning model, and 0.85 ± 0.03 for the InterVision model. The Hausdorff distance was 3.06 ± 1.13 for the general model, 2.81 ± 0.77 for the general fine-tuning model, and 2.52 ± 0.50 for the InterVision model. The InterVision model showed the best performance compared to the general model. The InterVision framework presents a versatile approach adaptable to various tasks where prior information is accessible, such as in ART settings. This capability is particularly valuable for accurately predicting complex organs and targets that pose challenges for traditional deep learning algorithms. Full article
(This article belongs to the Section Methodology, Drug and Device Discovery)
Show Figures

Figure 1

Figure 1
<p>The proposed InterVision framework. (1) illustrates the general model training using the original dataset, the training set and the validation set is divided using the original dataset. (2) illustrates the progress of the general fine-tuning model. The general fine-tuning model is using 1 personalized patient data for the training. For the evaluation, other fraction of the personalized patient data will be used. (3) shows the workflow of the InterVision framework. (3-1), (3-2) and (3-3) show the process of generating InterVision dataset.</p>
Full article ">Figure 2
<p>Conceptual representation of generating the InterVision dataset. A deformable vector is created by comparing each slide. Utilizing this deformable vector, we generate intermediate images between each slide. Consequently, we nearly doubled the size of the personalized dataset.</p>
Full article ">Figure 3
<p>Concept of calculating deformation vectors using control points. Images within the original image are repositioned based on the deformation vectors derived from each control point. The degree of deformation applied to a voxel increases as its proximity to the control point decreases.</p>
Full article ">Figure 4
<p>The architecture of Swin-Unet comprises an encoder, bottleneck, decoder, and skip connections. All components—the encoder, bottleneck, and decoder—are constructed using Swin Transformer blocks.</p>
Full article ">Figure 5
<p>Overview of the Swim Transformer block structure.</p>
Full article ">Figure 6
<p>Visual results of the optic chiasm (<b>a</b>), L cochlea (<b>b</b>) and L parotid (<b>c</b>) achieved by the general model, the general fine-tuning model and the InterVision model comparing with the manual contours in yellow.</p>
Full article ">
13 pages, 5820 KiB  
Article
Optic Nerve Sheath Ultrasound Image Segmentation Based on CBC-YOLOv5s
by Yonghua Chu, Jinyang Xu, Chunshuang Wu, Jianping Ye, Jucheng Zhang, Lei Shen, Huaxia Wang and Yudong Yao
Electronics 2024, 13(18), 3595; https://doi.org/10.3390/electronics13183595 - 10 Sep 2024
Viewed by 512
Abstract
The diameter of the optic nerve sheath is an important indicator for assessing the intracranial pressure in critically ill patients. The methods for measuring the optic nerve sheath diameter are generally divided into invasive and non-invasive methods. Compared to the invasive methods, the [...] Read more.
The diameter of the optic nerve sheath is an important indicator for assessing the intracranial pressure in critically ill patients. The methods for measuring the optic nerve sheath diameter are generally divided into invasive and non-invasive methods. Compared to the invasive methods, the non-invasive methods are safer and have thus gained popularity. Among the non-invasive methods, using deep learning to process the ultrasound images of the eyes of critically ill patients and promptly output the diameter of the optic nerve sheath offers significant advantages. This paper proposes a CBC-YOLOv5s optic nerve sheath ultrasound image segmentation method that integrates both local and global features. First, it introduces the CBC-Backbone feature extraction network, which consists of dual-layer C3 Swin-Transformer (C3STR) and dual-layer Bottleneck Transformer (BoT3) modules. The C3STR backbone’s multi-layer convolution and residual connections focus on the local features of the optic nerve sheath, while the Window Transformer Attention (WTA) mechanism in the C3STR module and the Multi-Head Self-Attention (MHSA) in the BoT3 module enhance the model’s understanding of the global features of the optic nerve sheath. The extracted local and global features are fully integrated in the Spatial Pyramid Pooling Fusion (SPPF) module. Additionally, the CBC-Neck feature pyramid is proposed, which includes a single-layer C3STR module and three-layer CReToNeXt (CRTN) module. During upsampling feature fusion, the C3STR module is used to enhance the local and global awareness of the fused features. During downsampling feature fusion, the CRTN module’s multi-level residual design helps the network to better capture the global features of the optic nerve sheath within the fused features. The introduction of these modules achieves the thorough integration of the local and global features, enabling the model to efficiently and accurately identify the optic nerve sheath boundaries, even when the ocular ultrasound images are blurry or the boundaries are unclear. The Z2HOSPITAL-5000 dataset collected from Zhejiang University Second Hospital was used for the experiments. Compared to the widely used YOLOv5s and U-Net algorithms, the proposed method shows improved performance on the blurry test set. Specifically, the proposed method achieves precision, recall, and Intersection over Union (IoU) values that are 4.1%, 2.1%, and 4.5% higher than those of YOLOv5s. When compared to U-Net, the precision, recall, and IoU are improved by 9.2%, 21%, and 19.7%, respectively. Full article
(This article belongs to the Special Issue Deep Learning-Based Object Detection/Classification)
Show Figures

Figure 1

Figure 1
<p>CBC-YOLOv5s optic nerve sheath segmentation algorithm.</p>
Full article ">Figure 2
<p>C3STR module.</p>
Full article ">Figure 3
<p>BoT3 module.</p>
Full article ">Figure 4
<p>CRTN module.</p>
Full article ">Figure 5
<p>Different algorithms for visualization with normal and blurry images.</p>
Full article ">Figure 6
<p>Different algorithms for segmentation examples with normal and blurry images.</p>
Full article ">
17 pages, 8056 KiB  
Article
A Multi-Organ Segmentation Network Based on Densely Connected RL-Unet
by Qirui Zhang, Bing Xu, Hu Liu, Yu Zhang and Zhiqiang Yu
Appl. Sci. 2024, 14(17), 7953; https://doi.org/10.3390/app14177953 - 6 Sep 2024
Viewed by 794
Abstract
The convolutional neural network (CNN) has been widely applied in medical image segmentation due to its outstanding nonlinear expression ability. However, applications of CNN are often limited by the receptive field, preventing it from modeling global dependencies. The recently proposed transformer architecture, which [...] Read more.
The convolutional neural network (CNN) has been widely applied in medical image segmentation due to its outstanding nonlinear expression ability. However, applications of CNN are often limited by the receptive field, preventing it from modeling global dependencies. The recently proposed transformer architecture, which uses a self-attention mechanism to model global context relationships, has achieved promising results. Swin-Unet is a Unet-like simple transformer semantic segmentation network that combines the dominant feature of both the transformer and Unet. Even so, Swin-Unet has some limitations, such as only learning single-scale contextual features, and it lacks inductive bias and effective multi-scale feature selection for processing local information. To solve these problems, the Residual Local induction bias-Unet (RL-Unet) algorithm is proposed in this paper. First, the algorithm introduces a local induction bias module into the RLSwin-Transformer module and changes the multi-layer perceptron (MLP) into a residual multi-layer perceptron (Res-MLP) module to model local and remote dependencies more effectively and reduce feature loss. Second, a new densely connected double up-sampling module is designed, which can further integrate multi-scale features and improve the segmentation accuracy of the target region. Third, a novel loss function is proposed that can significantly enhance the performance of multiple scales segmentation and the segmentation results for small targets. Finally, experiments were conducted using four datasets: Synapse, BraTS2021, ACDC, and BUSI. The results show that the performance of RL-Unet is better than that of Unet, Swin-Unet, R2U-Net, Attention-Unet, and other algorithms. Compared with them, RL-Unet produces significantly a lower Hausdorff Distance at 95% threshold (HD95) and comparable Dice Similarity Coefficient (DSC) results. Additionally, it exhibits higher accuracy in segmenting small targets. Full article
Show Figures

Figure 1

Figure 1
<p>The general structure of Residual Local induction bias-Unet (RL-Unet).</p>
Full article ">Figure 2
<p>Residual Local induction bias-Unet (RL-Unet) module structure.</p>
Full article ">Figure 3
<p>Spatial and channel interaction components of semantic information between different stages. (<b>a</b>) The structure of double up-sampling connection. (<b>b</b>) The structure of the downsampling connection.</p>
Full article ">Figure 4
<p>Joint channel attention.</p>
Full article ">Figure 5
<p>Joint spatial attention.</p>
Full article ">Figure 6
<p>Visualization of segmentation results on SYNAPSE dataset. (<b>a</b>) Ground Truth; (<b>b)</b> RL-Unet; (<b>c</b>) Attention Unet; (<b>d</b>) U-Net; (<b>e</b>) Swin-Unet.</p>
Full article ">Figure 7
<p>The segmentation results of U-Net, Swin-Unet, R2U-Net, Attention Unet, and RL-Unet algorithms visually. (<b>a</b>) Average DSC scores obtained by training different models. (<b>b</b>) Average HD obtained by training different models. (<b>c</b>) Comparison of loss curves of different algorithms.</p>
Full article ">Figure 8
<p>Visualization of segmentation results on ACDC and BUSI datasets. (<b>a</b>) Image; (<b>b</b>) Ground Truth; (<b>c</b>) RL-Unet; (<b>d</b>) Attention Unet; (<b>e</b>) Swin-Unet; (<b>f</b>) U-Net.</p>
Full article ">
26 pages, 6173 KiB  
Article
Enhancing Underwater Object Detection and Classification Using Advanced Imaging Techniques: A Novel Approach with Diffusion Models
by Prabhavathy Pachaiyappan, Gopinath Chidambaram, Abu Jahid and Mohammed H. Alsharif
Sustainability 2024, 16(17), 7488; https://doi.org/10.3390/su16177488 - 29 Aug 2024
Viewed by 1417
Abstract
Underwater object detection and classification pose significant challenges due to environmental factors such as water turbidity and variable lighting conditions. This research proposes a novel approach that integrates advanced imaging techniques with diffusion models to address these challenges effectively, aligning with Sustainable Development [...] Read more.
Underwater object detection and classification pose significant challenges due to environmental factors such as water turbidity and variable lighting conditions. This research proposes a novel approach that integrates advanced imaging techniques with diffusion models to address these challenges effectively, aligning with Sustainable Development Goal (SDG) 14: Life Below Water. The methodology leverages the Convolutional Block Attention Module (CBAM), Modified Swin Transformer Block (MSTB), and Diffusion model to enhance the quality of underwater images, thereby improving the accuracy of object detection and classification tasks. This study utilizes the TrashCan dataset, comprising diverse underwater scenes and objects, to validate the proposed method’s efficacy. This study proposes an advanced imaging technique YOLO (you only look once) network (AIT-YOLOv7) for detecting objects in underwater images. This network uses a modified U-Net, which focuses on informative features using a convolutional block channel and spatial attentions for color correction and a modified swin transformer block for resolution enhancement. A novel diffusion model proposed using modified U-Net with ResNet understands the intricate structures in images with underwater objects, which enhances detection capabilities under challenging visual conditions. Thus, AIT-YOLOv7 net precisely detects and classifies different classes of objects present in this dataset. These improvements are crucial for applications in marine ecology research, underwater archeology, and environmental monitoring, where precise identification of marine debris, biological organisms, and submerged artifacts is essential. The proposed framework advances underwater imaging technology and supports the sustainable management of marine resources and conservation efforts. The experimental results demonstrate that state-of-the-art object detection methods, namely SSD, YOLOv3, YOLOv4, and YOLOTrashCan, achieve mean accuracies ([email protected]) of 57.19%, 58.12%, 59.78%, and 65.01%, respectively, whereas the proposed AIT-YOLOv7 net reaches a mean accuracy ([email protected]) of 81.4% on the TrashCan dataset, showing a 16.39% improvement. Due to this improvement in the accuracy and efficiency of underwater object detection, this research contributes to broader marine science and technology efforts, promoting the better understanding and management of aquatic ecosystems and helping to prevent and reduce the marine pollution, as emphasized in SDG 14. Full article
Show Figures

Figure 1

Figure 1
<p>Proposed System architecture.</p>
Full article ">Figure 2
<p>Selection of images, highlighting the dataset’s diversity.</p>
Full article ">Figure 3
<p>Process flow description for color correction module of CBAM.</p>
Full article ">Figure 4
<p>Unprocessed images.</p>
Full article ">Figure 5
<p>Output image after color correction from the input image.</p>
Full article ">Figure 6
<p>U-Net architecture.</p>
Full article ">Figure 7
<p>(<b>a</b>) Modified Swin Transformer Block (MSTB) for enhancing image resolution. (<b>b</b>) Process flow of MSTB in underwater image resolution enhancement.</p>
Full article ">Figure 8
<p>Enhanced images.</p>
Full article ">Figure 9
<p>Process flow in diffusion model.</p>
Full article ">Figure 10
<p>Corrupted image.</p>
Full article ">Figure 11
<p>Image generated using the Diffusion model.</p>
Full article ">Figure 12
<p>MS COCO object detection from [<a href="#B5-sustainability-16-07488" class="html-bibr">5</a>].</p>
Full article ">Figure 13
<p>Output image with underwater objects detected and classified using AIT-YOLOv7.</p>
Full article ">Figure 14
<p>Precision, recall, and mAP metrics for the proposed AIT-YOLOv7.</p>
Full article ">Figure 14 Cont.
<p>Precision, recall, and mAP metrics for the proposed AIT-YOLOv7.</p>
Full article ">Figure 15
<p>(<b>a</b>) Input image from the dataset. (<b>b</b>) Convolutional block attention with modified Swin Transformer Block. (<b>c</b>) Diffusion model. (<b>d</b>) Detected and classified with high precision.</p>
Full article ">Figure 15 Cont.
<p>(<b>a</b>) Input image from the dataset. (<b>b</b>) Convolutional block attention with modified Swin Transformer Block. (<b>c</b>) Diffusion model. (<b>d</b>) Detected and classified with high precision.</p>
Full article ">
26 pages, 11283 KiB  
Article
Infrared Image Super-Resolution Network Utilizing the Enhanced Transformer and U-Net
by Feng Huang, Yunxiang Li, Xiaojing Ye and Jing Wu
Sensors 2024, 24(14), 4686; https://doi.org/10.3390/s24144686 - 19 Jul 2024
Viewed by 925
Abstract
Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but [...] Read more.
Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but comparatively few efforts have been dedicated to the exploration of infrared images. To address this, we design the Residual Swin Transformer and Average Pooling Block (RSTAB) and propose the SwinAIR, which can effectively extract and fuse the diverse frequency features in infrared images and achieve superior SR reconstruction performance. By further integrating SwinAIR with U-Net, we propose the SwinAIR-GAN for real infrared image SR reconstruction. SwinAIR-GAN extends the degradation space to better simulate the degradation process of real infrared images. Additionally, it incorporates spectral normalization, dropout, and artifact discrimination loss to reduce the potential image artifacts. Qualitative and quantitative evaluations on various datasets confirm the effectiveness of our proposed method in reconstructing realistic textures and details of infrared images. Full article
Show Figures

Figure 1

Figure 1
<p>Structure of our proposed Infrared Image Super-Resolution model based on Swin Transformer and Average Pooling, SwinAIR. Firstly, the input LR infrared <math display="inline"><semantics> <msub> <mi>I</mi> <mi>LR</mi> </msub> </semantics></math> undergoes the shallow feature extraction module <math display="inline"><semantics> <msub> <mi>H</mi> <mi>SF</mi> </msub> </semantics></math> to extract the shallow feature map <math display="inline"><semantics> <msub> <mi>F</mi> <mn>0</mn> </msub> </semantics></math>; then, <math display="inline"><semantics> <msub> <mi>F</mi> <mn>0</mn> </msub> </semantics></math> is deep extracted and refined by the deep feature extraction module. We obtain <math display="inline"><semantics> <msub> <mi>F</mi> <mi>n</mi> </msub> </semantics></math> after <math display="inline"><semantics> <mi mathvariant="italic">n</mi> </semantics></math> RSTABs. After <math display="inline"><semantics> <msub> <mi>F</mi> <mi>n</mi> </msub> </semantics></math> undergoes convolutional operations, it is further combined with <math display="inline"><semantics> <msub> <mi>F</mi> <mn>0</mn> </msub> </semantics></math> through residual connection to obtain the deep feature map <math display="inline"><semantics> <msub> <mi>F</mi> <mi>DF</mi> </msub> </semantics></math>. Finally, <math display="inline"><semantics> <msub> <mi>F</mi> <mi>DF</mi> </msub> </semantics></math> is input into the upsampling module <math display="inline"><semantics> <msub> <mi>H</mi> <mi>UP</mi> </msub> </semantics></math> to generate the output SR infrared image <math display="inline"><semantics> <msub> <mi>I</mi> <mi>SR</mi> </msub> </semantics></math>. The interpolation operation here is bicubic.</p>
Full article ">Figure 2
<p>Structure of the Swin Transformer and Average Pooling Layer (STAL). There are two ways to combine CNN with the transformer: serial or parallel. The serial method means that each layer can process either low- or high-frequency information but not both. Therefore, to allow each layer to process both types of information simultaneously, a parallel structure with channel separation is used to integrate the CNN and transformer. The feature map is initially divided into <math display="inline"><semantics> <msub> <mi>F</mi> <mi>h</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>F</mi> <mi>l</mi> </msub> </semantics></math>, which are then separately fed into the High-Frequency Feature Extraction Layer (HFEL) and the Swin Transformer Layer (STL). In STL, SW-MSA and MLP represent the self-attention module based on the shift-window mechanism and the multi-layer perceptron module, respectively.</p>
Full article ">Figure 3
<p>SwinAIR-GAN method. The generator network generates the SR infrared feature map SR1 from the LR infrared image and applies an exponential moving average (EMA) to the parameters to obtain the infrared feature map SR2. Various loss functions are then computed for SR1, SR2, and the HR infrared image. Finally, the corresponding discriminator network judge the authenticity of the generated infrared image.</p>
Full article ">Figure 4
<p>Structure of the discriminator network. We adopt a U-Net structure based on spectral normalization.</p>
Full article ">Figure 5
<p>The degradation model. We expand the degradation space and consider first- and second-order degradation processes to emulate the real degradation of infrared images more accurately.</p>
Full article ">Figure 6
<p>Our infrared data acquisition system. (<b>a</b>) System hardware. (<b>b</b>) Process of dataset construction. For SR tasks with custom degradation, images captured by the Iray camera are used as HR reference images and to calculate metrics. Due to differences in images captured by different cameras, we use no-reference metrics for the quantitative analysis of SR reconstruction quality for real SR tasks. The HR images captured by the Jinglin Chengdu cameras are stitched together to serve as only reference images for real SR tasks, assessing whether the SR reconstructed images are consistent with real textures and details.</p>
Full article ">Figure 7
<p>Visual comparison results achieved by different methods on the CVC14 dataset [<a href="#B56-sensors-24-04686" class="html-bibr">56</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore more realistic fence details.</p>
Full article ">Figure 8
<p>Visual comparison results achieved by different methods on the Iray-384 dataset [<a href="#B58-sensors-24-04686" class="html-bibr">58</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore more realistic fence details.</p>
Full article ">Figure 9
<p>Visual comparison results achieved by different methods on the Flir dataset [<a href="#B57-sensors-24-04686" class="html-bibr">57</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore closely spaced ground textures.</p>
Full article ">Figure 10
<p>Visual comparison results achieved by different methods on the Iray-security dataset [<a href="#B61-sensors-24-04686" class="html-bibr">61</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore more realistic vehicle contours.</p>
Full article ">Figure 11
<p>Visual comparison results achieved by different methods on the Iray-ship dataset [<a href="#B59-sensors-24-04686" class="html-bibr">59</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore more realistic ship contours.</p>
Full article ">Figure 12
<p>Visual comparison results achieved by different methods on the Iray-aerial photography dataset [<a href="#B60-sensors-24-04686" class="html-bibr">60</a>] at the ×4 scale. GT represents the original HR image in the red box of the leftmost image. Our proposed method can restore more realistic lines of door frames.</p>
Full article ">Figure 13
<p>We evaluate the impact of different module configurations and parameter selections on the performance of SwinAIR using PSNR (dB), SSIM, and the number of parameters (M) as performance metrics. In the figures, params represent the number of parameters, H and L denote the weights of high-frequency and low-frequency channels, respectively. (<b>a</b>) The impact of the number of RSTABs on the performance of SwinAIR. (<b>b</b>) The impact of the number of STALs within each RSTAB on the performance of SwinAIR. (<b>c</b>) The impact of the number of attention heads in the STL on the performance of SwinAIR. (<b>d</b>) The impact of different weights of the high-frequency and the low-frequency channels on the performance of SwinAIR.</p>
Full article ">Figure 14
<p>Frequency domain analysis of different datasets. We perform the Fourier transform and spectrum centralization on the images in the datasets. The central part of the spectrogram represents low-frequency information, while the edges represent high-frequency information. The overall range of the spectrum distribution is similar across different datasets.</p>
Full article ">Figure 15
<p>Visual comparison results achieved by different methods for real infrared images on the self-built dataset at the ×4 scale. GT represents the reference image in the red box of the leftmost image, which is obtained by stitching images from two HR cameras. We perform SR reconstruction on the corresponding real-scene images captured by the LR camera and observe the texture differences with the GT images to evaluate performance.</p>
Full article ">Figure 16
<p>Visual comparison results achieved by different methods for real infrared images on the Iray-384 dataset [<a href="#B58-sensors-24-04686" class="html-bibr">58</a>] at the ×4 scale. LR represents the original image in the red box of the leftmost image.</p>
Full article ">Figure 17
<p>Visual comparison results achieved by different methods for real infrared images on the ASL-TID dataset [<a href="#B57-sensors-24-04686" class="html-bibr">57</a>] at the ×4 scale. LR represents the original image in the red box of the leftmost image.</p>
Full article ">
15 pages, 8677 KiB  
Article
Multi-Beam Sonar Target Segmentation Algorithm Based on BS-Unet
by Wennuo Zhang, Xuewu Zhang, Yu Zhang, Pengyuan Zeng, Ruikai Wei, Junsong Xu and Yang Chen
Electronics 2024, 13(14), 2841; https://doi.org/10.3390/electronics13142841 - 19 Jul 2024
Viewed by 708
Abstract
Multi-beam sonar imaging detection technology is increasingly becoming the mainstream technology in fields such as hydraulic safety inspection and underwater target detection due to its ability to generate clearer images under low-visibility conditions. However, during the multi-beam sonar detection process, issues such as [...] Read more.
Multi-beam sonar imaging detection technology is increasingly becoming the mainstream technology in fields such as hydraulic safety inspection and underwater target detection due to its ability to generate clearer images under low-visibility conditions. However, during the multi-beam sonar detection process, issues such as low image resolution and blurred imaging edges lead to decreased target segmentation accuracy. Traditional filtering methods for echo signals cannot effectively solve these problems. To address these challenges, this paper introduces, for the first time, a multi-beam sonar dataset against the background of simulated crack detection for dam safety. This dataset included simulated cracks detected by multi-beam sonar from various angles. The width of the cracks ranged from 3 cm to 9 cm, and the length ranged from 0.2 m to 1.5 m. In addition, this paper proposes a BS-UNet semantic segmentation algorithm. The Swin-UNet model incorporates a dual-layer routing attention mechanism to enhance the accuracy of sonar image detail segmentation. Furthermore, an online convolutional reparameterization structure was added to the output end of the model to improve the model’s capability to represent image features. Comparisons of the BS-UNet model with commonly used semantic segmentation models on the multi-beam sonar dataset consistently demonstrated the BS-UNet model’s superior performance, as it improved semantic segmentation evaluation metrics such as Precision and IoU by around 0.03 compared to the Swin-UNet model. In conclusion, BS-UNet can effectively be applied in multi-beam sonar image segmentation tasks. Full article
(This article belongs to the Special Issue AI Used in Mobile Communications and Networks)
Show Figures

Figure 1

Figure 1
<p>Swin Transformer compared to ViT. (<b>a</b>) Swin Transformer’s window partitioning; (<b>b</b>) ViT’s window partitioning.</p>
Full article ">Figure 2
<p>Swin-UNet structure.</p>
Full article ">Figure 3
<p>BRA structure operation flow.</p>
Full article ">Figure 4
<p>Online convolution Re-parameterization. (<b>a</b>) Convolutional input and output under normal circumstances; (<b>b</b>) convolutional input and output after module linearization; (<b>c</b>) convolutional input and output after module fusion.</p>
Full article ">Figure 5
<p>The BS-UNet network structure, where the BRA attention mechanism is between Patch Embedded and the input Swin Transformer block, while convolutional reparameterization is at the final output end.</p>
Full article ">Figure 6
<p>Experimental site setup and data collection process.</p>
Full article ">Figure 7
<p>The horizontal and vertical beam opening angles and detection range of the M1200d are as follows.</p>
Full article ">Figure 8
<p>Assembly diagram of sonar and gimbal.</p>
Full article ">Figure 9
<p>Partial two-dimensional sonar images containing crack defects.</p>
Full article ">Figure 10
<p>Ablation experiment visualization results. Green boxes refer to areas where cracks are located, and red circles indicate missing or excess content in the segmentation results. (<b>a</b>) Original image; (<b>b</b>) segmentation standard after image annotation; (<b>c</b>) segmentation results of Swin-UNet; (<b>d</b>) segmentation results of Swin-UNet + BRA; (<b>e</b>) segmentation results of the proposed algorithm in this paper.</p>
Full article ">Figure 11
<p>Comparison of experimental visualization results. Green boxes refer to areas where cracks are located, and red circles indicate missing or excess content in the segmentation results. (<b>a</b>) Original image; (<b>b</b>) segmentation standard after image annotation; (<b>c</b>) segmentation results of Res-UNet; (<b>d</b>) segmentation results of Att-UNet; (<b>e</b>) segmentation results of Trans-UNet; (<b>f</b>) segmentation results of Swin-UNet; (<b>g</b>) segmentation results of the proposed algorithm in this paper.</p>
Full article ">Figure 12
<p>Loss function graph and Dice graph. (<b>a</b>) Comparison of loss function curves for Att-UNet, Swin-UNet, and BS-UNet Algorithms. (<b>b</b>) Comparison of dice function curves for Att-UNet, Swin-UNet, and BS-UNet Algorithms.</p>
Full article ">
Back to TopTop