Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion Utilizing a Multi-Scale Dilated Convolutional Pyramid
<p>Overall DeepLabv3+ architecture.</p> "> Figure 2
<p>Spatial Attention module.</p> "> Figure 3
<p>SKNet architecture.</p> "> Figure 4
<p>SDAMNet architecture.</p> "> Figure 5
<p>SDAMNet flowchart.</p> "> Figure 6
<p>SCSDM module.</p> "> Figure 7
<p>SFFM architecture.</p> "> Figure 8
<p>Sample images of the Aerial Drone Image dataset.</p> "> Figure 9
<p>Sample images of the UDD6 dataset.</p> "> Figure 10
<p>Aerial Drone Image dataset—ablation study on DCASPP (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, and (<b>d</b>) Deeplabv3+-DCASPP. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 11
<p>UDD6 images dataset—ablation study on DCASPP (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, and (<b>d</b>) Deeplabv3+-DCASPP. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 12
<p>Aerial Drone Image—ablation study on SCSDM (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, and (<b>d</b>) Deeplabv3+-SCSDM. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 13
<p>UDD6 dataset—ablation study on SCSDM (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, and (<b>d</b>) Deeplabv3+-SCSDM. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 14
<p>Aerial Drone Image dataset—ablation study on SFFM (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, and (<b>d</b>) Deeplabv3+-SFFM. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 15
<p>UDD6 dataset—ablation study on SFFM (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+ image, and (<b>d</b>) Deeplabv3+-SFFM Image. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 16
<p>Results on the Aerial Drone Image dataset (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, (<b>d</b>) Deeplabv3+ with DCASPP, (<b>e</b>) Deeplabv3+ with DCASPP and SFFM, and (<b>f</b>) SDAMNet. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 17
<p>Results on the UDD6 dataset (<b>a</b>) original image, (<b>b</b>) mask, (<b>c</b>) Deeplabv3+, (<b>d</b>) Deeplabv3+ with DCASPP, (<b>e</b>) Deeplabv3+ with DCASPP and SFFM, and (<b>f</b>) SDAMNet. The red squares highlight the comparative effectiveness of the module in specific regions.</p> "> Figure 18
<p>Results on the UDD6 images (<b>a</b>) original RGB color image, (<b>b</b>) ground truth label, (<b>c</b>) SwinV2, (<b>d</b>) UVid-Net, and (<b>e</b>) SDAMNet. The red squares highlight the comparative effectiveness of the model in specific regions.</p> "> Figure 19
<p>Visualization results of different methods on the UDD6 dataset (<b>a</b>) original RGB color image, (<b>b</b>) ground truth labels, (<b>c</b>) DANet, (<b>d</b>) Segformer, and (<b>e</b>) SDAMNet. The red squares highlight the comparative effectiveness of the model in specific regions.</p> ">
Abstract
:1. Introduction
- A Dilated Convolutional Atrous Spatial Pyramid Pooling (DCASPP) module is constructed with multi-level dilated convolutions and global average pooling structures. In DCASPP, contextual information in semantic segmentation is extracted effectively, significantly enhancing the model’s understanding of complex scenes and segmentation accuracy.
- SCSDM is introduced, which cleverly integrates the design of multi-scale receptive field fusion and spatial attention. SCSDM significantly enhances the model’s ability to perceive critical regions, thereby improving the accuracy and efficiency of semantic segmentation.
- The SFFM is proposed, which incorporates feature-weighted fusion, channel attention, and spatial attention structures. This design effectively overcomes the challenges of limited semantic information in low-level features and reduced resolution in high-level features within semantic segmentation. Subsequently, the overall image comprehension and segmentation accuracy are significantly enhanced, demonstrating the SFFM’s crucial role in advancing the performance of semantic segmentation models.
- Extensive experiments on two public datasets validate the effectiveness of SDAMNet. Specifically, SADMNet achieves a Mean Intersection over Union (MIOU) of 67.2% on the Aerial Semantic Segmentation Drone dataset and 75.3% on the UDD6 dataset.
2. Related Work
2.1. Deeplabv3+
2.2. Attention Module
3. Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion with the Multi-Scale Dilated Convolutional Pyramid
3.1. Overall Architecture
- Encoder–Decoder ArchitectureSDAMNet is an end-to-end neural network utilizing an encoder–decoder architecture.Encoder: Uses ResNet101 as the backbone to extract multi-scale feature information.Decoder: Integrates high-level and low-level features, refines the feature maps using the SFFM module, adjusts channel numbers, processes with convolutions, upscales to the original image size, and enhances segmentation accuracy and detail capture.
- Key ModulesDCASPP (Dilated Convolutional Atrous Spatial Pyramid Pooling): Enhances feature extraction by capturing information at multiple scales and incorporating the global context.SCSDM (Semantic Channel Space Details Module): Improves the model’s ability to focus on important features and accurately capture target boundaries and details.SFFM (Semantic Features Fusion Module): Combines high-level and low-level features to utilize the strengths of both.
- Processing StepsFeature Extraction: The encoder processes the input image to generate high-level and low-level semantic information.Feature Enhancement: The DCASPP and SCSDM modules refine the feature maps, enhancing the model’s perception of key areas and details.Feature Fusion: The SFFM module merges the enhanced feature maps, balancing the information from different levels.Upsampling and Prediction: The decoder adjusts the feature maps to the original input image size through upsampling, followed by the final prediction.
3.2. Dilated Convolutional Atrous Spatial Pyramid Pooling
3.3. Semantic Channel Space Details Module
3.4. Semantic Features Fusion Module
4. Experiments
4.1. Datasets
4.1.1. Aerial Drone Image Dataset
4.1.2. UDD6 Dataset
4.2. Implementation Details and Evaluations
4.2.1. Experimental Setup
4.2.2. Evaluation Indicators
4.3. Experimental Results
4.3.1. Ablation Study on DCASPP
4.3.2. Ablation Study on SCSDM
4.3.3. Ablation Study on SFFM
4.3.4. Ablation Study on the Overall SDAMNet Architecture
4.3.5. Evaluation of the Aerial Drone Image Dataset
4.3.6. Comparison of the UDD6 Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Nomenclature
Abbreviation | Full Term |
SA | Spatial Attention |
CA | Channel Attention |
Rate | Dilation Rate |
ASPP | Atrous Spatial Pyramid Pooling |
SDAMNet | Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion with the Multi-Scale Dilated Convolutional Pyramid |
DCASPP | Dilated Convolutional Atrous Spatial Pyramid Pooling |
SCSDM | Semantic Channel Space Details Module |
SFFM | Semantic Features Fusion Module |
MIOU | Mean Intersection over Union |
FWIOU | Frequency Weighted Intersection over Union |
ACC | Accuracy |
ReLU | Rectified Linear Unit |
BN | Batch Normalization |
SGD | Stochastic Gradient Descent |
TP | True Positive |
FP | False Positive |
TN | True Negative |
FN | False Negative |
References
- Wang, Y.; Yu, X.; Yang, Y.; Zhang, X.; Zhang, Y.; Zhang, L.; Feng, R.; Xue, J. A multi-branched semantic segmentation network based on twisted information sharing pattern for medical images. Comput. Methods Programs Biomed. 2024, 243, 107914. [Google Scholar] [CrossRef] [PubMed]
- Wang, F.; Wang, H.; Qin, Z.; Tang, J. UAV target detection algorithm based on improved YOLOv8. IEEE Access 2023, 11, 116534–116544. [Google Scholar] [CrossRef]
- Jung, J.; Kim, S.; Jang, W.; Seo, B.; Lee, K.J. An energy-efficient, unified CNN accelerator for real-time multi-object semantic segmentation for autonomous vehicle. IEEE Trans. Circuits Syst. I Regul. Pap. 2024, 71, 2093–2104. [Google Scholar] [CrossRef]
- Otsu, N. A threshold selection method from gray-level histograms. Automatica 1975, 11, 23–27. [Google Scholar] [CrossRef]
- Muthukrishnan, R.; Radha, M. Edge detection techniques for image segmentation. Int. J. Comput. Sci. Inf. Technol. 2011, 3, 259. [Google Scholar] [CrossRef]
- Kaganami, H.G.; Beiji, Z. Region-based segmentation versus edge detection. In Proceedings of the 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Kyoto, Japan, 12–14 September 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1217–1221. [Google Scholar]
- Roberts, L.G. Machine Perception of Three-Dimensional Solids. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1963. [Google Scholar]
- Kanopoulos, N.; Vasanthavada, N.; Baker, R.L. Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 1988, 23, 358–367. [Google Scholar] [CrossRef]
- Prewitt, J.M. Object enhancement and extraction. Pict. Process. Psychopictorics 1970, 10, 15–19. [Google Scholar]
- Gunn, S.R. On the discrete representation of the Laplacian of Gaussian. Pattern Recognit. 1999, 32, 1463–1472. [Google Scholar] [CrossRef]
- Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar] [CrossRef]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Ding, X.; Guo, Y.; Ding, G.; Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1911–1920. [Google Scholar]
- Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation transformer: Object-contextual representations for semantic segmentation. arXiv 2021, arXiv:1909.11065. [Google Scholar]
- Lyu, Y.; Vosselman, G.; Xia, G.-S.; Yang, M.Y. Bidirectional multi-scale attention networks for semantic segmentation of oblique UAV imagery. arXiv 2021, arXiv:2102.03099. [Google Scholar] [CrossRef]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
- Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603018. [Google Scholar] [CrossRef]
- Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Girisha, S.; Verma, U.; Pai, M.M.; Pai, R.M. Uvid-net: Enhanced semantic segmentation of uav aerial videos by embedding temporal information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4115–4127. [Google Scholar] [CrossRef]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision; Springer, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Luo, H.; Lu, Y. DeepLabV3-SAM: A novel image segmentation method for rail transportation. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer Communication (EIECC), Wuhan, China, 8–10 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Hou, X.; Chen, P.; Gu, H. LM-DeeplabV3+: A Lightweight Image Segmentation Algorithm Based on Multi-Scale Feature Interaction. Appl. Sci. 2024, 14, 1558. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An efficient pyramid squeeze attention block on convolutional neural network. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 1161–1177. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
- Zhao, S.; Wang, Y.; Tian, K. Using AAEHS-net as an attention-based auxiliary extraction and hybrid subsampled network for semantic segmentation. Comput. Intell. Neurosci. 2022, 2022, 1536976. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Wang, Y.; Lu, P.; Chen, Y.; Wang, G. Large-scale structure from motion with semantic constraints of aerial images. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Guangzhou, China, 23–26 November 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 347–359. [Google Scholar]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Method | Year | Key Contributions |
---|---|---|
Threshold-Based | 1975 | Classifies pixels into foreground and background by setting a grayscale threshold. Simple but poor on complex images. |
Edge-Based | 2011 | Detects differences in grayscale values between pixels and forms edge contours. Sensitive to noise. |
Region-Based | 2009 | Segments images based on spatial information and pixel similarity features. Sensitive to noise. |
Method | Year | Key Contributions |
---|---|---|
AlexNet | 2012 | Employs multiple convolutions to extract local features, significantly improving segmentation beyond traditional methods. |
BiSeNet | 2018 | Introduces a detail branch for extracting details and boundary information, and a semantic branch for capturing global contextual information, enriching spatial features. |
BiSeNetV2 | 2021 | Enhances BiSeNet with a more powerful detail branch and a lightweight spatial branch, achieving better performance and real-time inference speed. |
DANet | 2019 | Integrates local and global features using spatial and channel attention modules, enhancing feature representation for precise segmentation. |
ACNet | 2019 | Utilizes Asymmetric Convolution Blocks (ACBs) to replace standard convolutions, improving accuracy and robustness against distortions. |
OCRNet | 2021 | Aggregates object-contextual representations to enhance pixel classification by leveraging the relationship between pixels and object regions. |
BiMSANet | 2021 | Tackles scale variation in oblique aerial images using bidirectional multi-scale attention networks for more effective semantic segmentation. |
Twins | 2021 | Revisits spatial attention design, proposing Twins-PCPVT and Twins-SVT for optimized matrix multiplications. |
HMANet | 2021 | Integrates class-augmented and region shuffle attention to enhance VHR aerial image segmentation. |
DDRNet | 2021 | Fuses features through bidirectional and dense connections, introducing a multi-scale feature fusion technique. |
U-Net | 2015 | Known for its encoder–decoder architecture, using skip connections to fuse low-level and deep-level information. |
U2-Net | 2020 | Adopts a two-level nested U-shaped structure with Residual U blocks (RSU) to capture more contextual information, reducing training costs. |
SegFormer | 2021 | Unifies Transformers with MLP decoders, using a hierarchically structured encoder and lightweight decoder for efficient multi-scale feature aggregation. |
UVid-Net | 2021 | Enhances UAV video semantic segmentation by incorporating temporal information and a feature-refiner module for accurate, temporally consistent labeling. |
SETR | 2021 | Uses a pure transformer to encode images into patch sequences, providing global context and powerful segmentation capabilities. |
Swin Transformer | 2021 | Introduces a hierarchical architecture with shifted windows for efficient self-attention, addressing computational challenges of large images. |
Swin V2 | 2022 | Introduces residual-post-norm, log-spaced position bias, and SimMIM self-supervised pre-training to tackle training instability and resolution gaps. |
Swin-UNet | 2022 | Combines hierarchical Swin Transformers as the encoder and symmetric Swin Transformers as the decoder for high-performance image semantic segmentation. |
SegNet | 2017 | Known for memory and computational efficiency, but max-pooling and subsampling may produce coarse segmentation results. |
SegNeXt | 2022 | Introduces Multi-Scale Convolutional Attention (MSCA) structure with depth-wise separable convolutions for enhanced semantic segmentation performance. |
DeepLabV3+ | 2018 | Combines an encoder–decoder structure with an improved Atrous Spatial Pyramid Pooling (ASPP) module for strong semantic understanding capabilities. |
DeepLabV3-SAM | 2023 | Integrates the Segment Anything Model (SAM) with traditional semantic segmentation algorithms for accurate, low-cost, and fully automated image segmentation. |
LM-DeepLabv3+ | 2024 | Lightweight method using MobileNetV2 as the backbone network, with ECA-Net and EPSA-Net attention mechanisms to reduce parameters and computational complexity while enhancing feature representation. |
Value | Aerial Drone Image Dataset | UDD6 Dataset |
---|---|---|
Batch size | 8 | 4 |
Optimizer | SGD | SGD |
Learning strategy | CosineAnnealingLR | CosineAnnealingLR |
Epochs | 150 | 150 |
Batch size | 8 | 4 |
MIOU (%) | ACC (%) | FWIOU (%) | |
---|---|---|---|
Methods on the Aerial Drone Image Dataset | |||
Deeplabv3+ | 64.28 | 93.82 | 89.09 |
Deeplabv3+DCASPP | 64.99 | 94.04 | 89.41 |
Methods on the UDD6 Dataset | |||
Deeplabv3+ | 73.14 | 86.91 | 79.26 |
Deeplabv3+DCASPP | 73.67 | 87.41 | 79.47 |
MIOU (%) | ACC (%) | FWIOU (%) | |
---|---|---|---|
Methods on the Aerial Drone Image Dataset | |||
Deeplabv3+ | 64.28 | 93.82 | 89.09 |
Deeplabv3+SCSDM | 65.49 | 94.22 | 89.69 |
Methods on the UDD6 Dataset | |||
Deeplabv3+ | 73.14 | 86.91 | 79.26 |
Deeplabv3+SCSDM | 73.93 | 87.71 | 79.69 |
MIOU (%) | ACC (%) | FWIOU (%) | |
---|---|---|---|
Methods on the Aerial Drone Image Dataset | |||
Deeplabv3+ | 64.28 | 93.82 | 89.09 |
Deeplabv3+-SFFM | 65.20 | 94.05 | 89.48 |
Methods on the UDD6 Dataset | |||
Deeplabv3+ | 73.14 | 86.91 | 79.26 |
Deeplabv3+-SFFM | 73.94 | 87.74 | 79.93 |
MIOU (%) | ACC (%) | FWIOU (%) | |
---|---|---|---|
concat | 64.28 | 93.82 | 89.09 |
add | 64.20 | 93.82 | 89.07 |
SFFM | 65.20 | 94.05 | 89.48 |
Deeplabv3+ | DCASPP | SFFM | SCSDM | MIOU (%) | ACC (%) | FWIOU (%) |
---|---|---|---|---|---|---|
√ | 64.28 | 93.82 | 89.09 | |||
√ | √ | 64.99 | 94.04 | 89.41 | ||
√ | √ | √ | 65.48 | 94.19 | 89.68 | |
√ | √ | √ | √ | 67.17 | 94.65 | 90.45 |
Deeplabv3+ | DCASPP | SFFM | SCSDM | MIOU (%) | ACC (%) | FWIOU (%) |
---|---|---|---|---|---|---|
√ | 73.14 | 86.91 | 79.26 | |||
√ | √ | 73.67 | 87.41 | 79.47 | ||
√ | √ | √ | 74.19 | 87.81 | 80.08 | |
√ | √ | √ | √ | 75.27 | 89.02 | 80.83 |
Method | MIOU (%) |
---|---|
Deeplabv3+ | 64.3 |
DANet | 63.8 |
SETR | 58.6 |
Swin | 63.1 |
SwinV2 | 64.2 |
Twins | 65.3 |
BiMSANet | 65.5 |
UVid-Net | 65.8 |
HMANet | 63.5 |
SDAMNet | 67.2 |
Method | MIOU (%) | mAcc (%) |
---|---|---|
Deeplabv3+ | 73.1 | 86.9 |
DANet | 73.7 | 87.5 |
ACNet | 74.1 | 87.8 |
OCRNet | 73.9 | 87.7 |
SETR | 71.9 | 85.5 |
Swin | 72.7 | 86.3 |
Segformer | 74.9 | 88.1 |
SDAMNet | 75.3 | 89.0 |
Method | Inference Time (ms) | Model Parameter (MB) | FLOPs (G) |
---|---|---|---|
Deeplabv3+ | 55.0 | 63 | 1232 |
DANet | 55.6 | 69 | 1836 |
ACNet | 56.4 | 72 | 1938 |
OCRNet | 58.6 | 83 | 2036 |
SETR | 77.8 | 308 | 3018 |
Swin | 65.2 | 234 | 2336 |
Segformer | 60.4 | 168 | 2158 |
SDAMNet | 59.3 | 91 | 2075 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, S.; Wang, Z.; Huo, Z.; Zhang, F. Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion Utilizing a Multi-Scale Dilated Convolutional Pyramid. Sensors 2024, 24, 5305. https://doi.org/10.3390/s24165305
Zhao S, Wang Z, Huo Z, Zhang F. Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion Utilizing a Multi-Scale Dilated Convolutional Pyramid. Sensors. 2024; 24(16):5305. https://doi.org/10.3390/s24165305
Chicago/Turabian StyleZhao, Shan, Zihao Wang, Zhanqiang Huo, and Fukai Zhang. 2024. "Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion Utilizing a Multi-Scale Dilated Convolutional Pyramid" Sensors 24, no. 16: 5305. https://doi.org/10.3390/s24165305