Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation
"> Figure 1
<p>Illustration of Swin Transformer Block.</p> "> Figure 2
<p>The Framework of ICTNet.</p> "> Figure 3
<p>The Pipeline of EFA.</p> "> Figure 4
<p>The Pipeline of DFA.</p> "> Figure 5
<p>A Sample of the ISPRS Vaihingen Benchmark.</p> "> Figure 6
<p>A Sample of the ISPRS Potsdam Benchmark.</p> "> Figure 7
<p>A Sample of the DeepGlobe Benchmark.</p> "> Figure 8
<p>Visualizations of predictions of the test set of the ISPRS Vaihingen benchmark. (<b>a</b>) Input image, (<b>b</b>) ground truth, (<b>c</b>) FCN-8s, (<b>d</b>) SegNet, (<b>e</b>) U-Net, (<b>f</b>) DeepLab V3+, (<b>g</b>) CBAM, (<b>h</b>) DANet, (<b>i</b>) ResUNet-a, (<b>j</b>) SCAttNet, (<b>k</b>) HCANet, (<b>l</b>) ICTNet.</p> "> Figure 9
<p>Visualizations of predictions of the test set of the ISPRS Potsdam benchmark. (<b>a</b>) Input image, (<b>b</b>) ground truth, (<b>c</b>) FCN-8s, (<b>d</b>) SegNet, (<b>e</b>) U-Net, (<b>f</b>) DeepLab V3+, (<b>g</b>) CBAM, (<b>h</b>) DANet, (<b>i</b>) ResUNet-a, (<b>j</b>) SCAttNet, (<b>k</b>) HCANet, (<b>l</b>) ICTNet.</p> "> Figure 10
<p>Visualizations of predictions of the test set of the DeepGlobe benchmark. (<b>a</b>) Input image, (<b>b</b>) ground truth, (<b>c</b>) FCN-8s, (<b>d</b>) SegNet, (<b>e</b>) U-Net, (<b>f</b>) DeepLab V3+, (<b>g</b>) CBAM, (<b>h</b>) DANet, (<b>i</b>) ResUNet-a, (<b>j</b>) SCAttNet, (<b>k</b>) HCANet, (<b>l</b>) ICTNet.</p> "> Figure 11
<p>The pipelines of (<b>a</b>) CB-only encoder, (<b>b</b>) STB-only encoder.</p> "> Figure A1
<p>Training loss of ablation study of DFA.</p> "> Figure A2
<p>Training mIoU of ablation study of DFA.</p> ">
Abstract
:1. Introduction
- (1)
- Pure transformers have demonstrated astounding performance on computer vision tasks with their impressive and robust scalability in establishing long-range dependencies. However, pure transformers flatten the raw image into single-dimensional signals for high-resolution remote sensing imagery, breaking the inherent structures and missing countless details. After revisiting convolutions, CNNs enable the learning of the locality that supplies complementary geometric and topologic information from low-level features fundamentally. Therefore, it is necessary to sufficiently coalesce the convolved local patterns and attentive affinity, which greatly enriches local and distant contextual information, strengthening the distinguishability of learnt representations.
- (2)
- Apart from encoding distinguishable pixel-wise representations, the decoder also plays a vital role in recovering feature maps while preserving the fundamental features. However, the transform loss is inevitable as the network deepens. The existing decoders deploy multiple stages to constantly enlarge the spatial resolution from former layers by bilinear upsampling. Moreover, the shallow layers’ low-level features that contain valuable clues for predictions are not aggregated with relevant decoded ones.
- (1)
- To leverage long-range visual dependencies and local patterns simultaneously, a typical encoder that interlaces convolution and transformer hierarchically is proposed to produce fine-grained features with high representativeness and distinguishability. This design of gradually integrating convolved and transformed representations facilitates the network to exploit the advantages of convolutions in extracting low-level features, strengthening locality, and the advantages of transformers in modeling distant visual dependencies at multiple scales.
- (2)
- Striving to recover the features losslessly and efficiently, in the decoder stage, DUP followed by fusing with a corresponding encoded feature map is devised as the basic unit for constantly expanding spatial resolution. Instead of multiple convolutions and upsampling operations, one-step matrix projection refers to DUP enabling well-preserved details while enlarging spatial size with an arbitrary spatial ratio.
- (3)
- Concerning the variants and heterogeneity of aerial and satellite imagery, extensive experiments are conducted on three typical semantic segmentation benchmarks of remote sensing imagery, including ISPRS Vaihingen [41], ISPRS Potsdam [42], and DeepGlobe [43]. Quantitative and qualitative evaluations are compared and analyzed to validate the effectiveness and superiority. Furthermore, the ablation study is implemented to verify the efficacy of incorporating the transformer and the designed decoder.
2. Related Works
2.1. Attention-Based Semantic Segmentation for RSI
2.2. Transformers for Semantic Segmentation
2.3. Transformers in Remote Sensing
3. The Proposed Method
3.1. Revisiting Swin Transformer
3.2. ICTNet
3.2.1. Overall Framework
3.2.2. Encoded Feature Aggregation
3.2.3. Decoded Feature Aggregation
4. Experiments and Discussion
4.1. Settings
4.1.1. Datasets
- ISPRS Vaihingen Benchmark
- 2.
- ISPRS Potsdam Benchmark
- 3.
- DeepGlobe Land Cover Dataset
4.1.2. Implement Details
4.1.3. Numerical Metrics
4.2. Compare to State-of-the-Art Methods
4.2.1. Results on Vaihingen Benchmark
4.2.2. Results on Potsdam Benchmark
4.2.3. Results on DeepGlobe Benchmark
4.3. Ablation Study of DFA
4.4. Ablation Study of DUP
4.5. Ablation Study of STB and CB
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Ablation Study of DFA
Appendix B. Feature Maps of Different Encoder Stages of ICTNet
Stages | Block Name | Output Feature | Block Name | Output Feature |
---|---|---|---|---|
Stage 1 | PP | CB1 | ||
STB1 | CB2 | |||
EFA1 | ||||
Stage 2 | STB2 | CB3 | ||
EFA2 | ||||
Stage 3 | STB3 | CB4 | ||
EFA3 | ||||
Stage 4 | STB4 | CB5 | ||
EFA4 |
References
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
- Caballero, I.; Roca, M.; Santos-Echeandía, J.; Bernárdez, P.; Navarro, G. Use of the Sentinel-2 and Landsat-8 Satellites for Water Quality Monitoring: An Early Warning Tool in the Mar Menor Coastal Lagoon. Remote Sens. 2022, 14, 2744. [Google Scholar] [CrossRef]
- Li, X.; Lyu, X.; Tong, Y.; Li, S.; Liu, D. An object-based river extraction method via optimized transductive support vector machine for multi-spectral remote-sensing images. IEEE Access 2019, 7, 46165–46175. [Google Scholar] [CrossRef]
- Wang, H.; Li, W.; Huang, W.; Nie, K. A Multi-Objective Permanent Basic Farmland Delineation Model Based on Hybrid Particle Swarm Optimization. ISPRS Int. J. Geo-Inf. 2020, 9, 243. [Google Scholar] [CrossRef]
- Di Pilato, A.; Taggio, N.; Pompili, A.; Iacobellis, M.; Di Florio, A.; Passarelli, D.; Samarelli, S. Deep Learning Approaches to Earth Observation Change Detection. Remote Sens. 2021, 13, 4083. [Google Scholar] [CrossRef]
- Wang, H.; Li, W.; Huang, W.; Niu, J.; Nie, K. Research on land use classification of hyperspectral images based on multiscale superpixels. Math. Biosci. Eng. 2020, 17, 5099–5119. [Google Scholar] [CrossRef]
- Trenčanová, B.; Proença, V.; Bernardino, A. Development of Semantic Maps of Vegetation Cover from UAV Images to Support Planning and Management in Fine-Grained Fire-Prone Landscapes. Remote Sens. 2022, 14, 1262. [Google Scholar] [CrossRef]
- Can, G.; Mantegazza, D.; Abbate, G.; Chappuis, S.; Giusti, A. Semantic segmentation on Swiss3DCities: A benchmark study on aerial photogrammetric 3D pointcloud dataset. Pattern Recognit. Lett. 2021, 150, 108–114. [Google Scholar] [CrossRef]
- Liu, C.; Zeng, D.; Akbar, A.; Wu, H.; Jia, S.; Xu, Z.; Yue, H. Context-Aware Network for Semantic Segmentation Towards Large-Scale Point Clouds in Urban Environments. IEEE Trans. Geosci. Remote Sens. 2022; early access. [Google Scholar] [CrossRef]
- Pham, H.N.; Dang, K.B.; Nguyen, T.V.; Tran, N.C.; Ngo, X.Q.; Nguyen, D.A.; Phan, T.T.H.; Nguyen, T.T.; Guo, W.; Ngo, H.H. A new deep learning approach based on bilateral semantic segmentation models for sustainable estuarine wetland ecosystem management. Sci. Total Environ. 2022, 838, 155826. [Google Scholar] [CrossRef]
- Bragagnolo, L.; Rezende, L.; da Silva, R.; Grzybowski, J.M.V. Convolutional neural networks applied to semantic segmentation of landslide scars. Catena 2021, 201, 105189. [Google Scholar] [CrossRef]
- Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
- Csurka, G.; Perronnin, F. A Simple High Performance Approach to Semantic Segmentation. In Proceedings of the British Machine Vision Conference (BMVC), Leeds, UK, 20 June 2008; pp. 1–10. [Google Scholar]
- Chai, D.; Newsam, S.; Huang, J. Aerial image semantic segmentation using DCNN predicted distance maps. ISPRS J. Photogramm. Remote Sens. 2020, 161, 309–322. [Google Scholar] [CrossRef]
- Saha, I.; Maulik, U.; Bandyopadhyay, S.; Plewczynski, D. SVMeFC: SVM Ensemble Fuzzy Clustering for Satellite Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2012, 9, 52–55. [Google Scholar] [CrossRef]
- Zheng, C.; Wang, L. Semantic Segmentation of Remote Sensing Imagery Using Object-Based Markov Random Field Model with Regional Penalties. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 1924–1935. [Google Scholar] [CrossRef]
- Smith, A. Image segmentation scale parameter optimization and land cover classification using the Random Forest algorithm. J. Spat. Sci. 2010, 55, 69–79. [Google Scholar] [CrossRef]
- Liu, Y.; Piramanayagam, S.; Monteiro, S.T.; Saber, E. Semantic segmentation of multisensor remote sensing imagery with deep ConvNets and higher-order conditional random fields. J. Appl. Remote Sens. 2019, 13, 016501. [Google Scholar] [CrossRef]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021; early access. [Google Scholar] [CrossRef]
- Sun, L.; Wu, Z.; Liu, J.; Xiao, L.; Wei, Z. Supervised spectral–spatial hyperspectral image classification with weighted Markov random fields. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1490–1503. [Google Scholar] [CrossRef]
- Sun, L.; Ma, C.; Chen, Y.; Zheng, Y.; Shim, H.J.; Wu, Z.; Jeon, B. Low rank component induced spatial-spectral kernel method for hyperspectral image classification. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3829–3842. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MCCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Huang, Z.; Zhang, Q.; Zhang, G. MLCRNet: Multi-Level Context Refinement for Semantic Segmentation in Aerial Images. Remote Sens. 2022, 14, 1498. [Google Scholar] [CrossRef]
- Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef]
- Du, S.; Du, S.; Liu, B.; Zhang, X. Mapping large-scale and fine-grained urban functional zones from VHR images using a multi-scale semantic segmentation network and object based approach. Remote Sens. Environ. 2021, 261, 112480. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, LA, USA, 4–9 December 2017; pp. 5999–6009. [Google Scholar]
- Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020; early access. [Google Scholar] [CrossRef]
- Li, X.; Xu, F.; Lyu, X.; Gao, H.; Tong, Y.; Cai, S.; Li, S.; Liu, D. Dual attention deep fusion semantic segmentation networks of large-scale satellite remote-sensing images. Int. J. Remote Sens. 2021, 42, 3583–3610. [Google Scholar] [CrossRef]
- Li, R.; Zheng, S.; Zhang, C.; Su, J.; Atkinson, P.M. Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
- Li, X.; Xu, F.; Xia, R.; Lyu, X.; Gao, H.; Tong, Y. Hybridizing Cross-Level Contextual and Attentive Representations for Remote Sensing Imagery Semantic Segmentation. Remote Sens. 2021, 13, 2986. [Google Scholar] [CrossRef]
- Li, X.; Li, T.; Chen, Z.; Zhang, K.; Xia, R. Attentively Learning Edge Distributions for Semantic Segmentation of Remote Sensing Imagery. Remote Sens. 2022, 14, 102. [Google Scholar] [CrossRef]
- Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022; early access. [Google Scholar] [CrossRef]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), online, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Nashville, TN, USA, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Tian, Z.; He, T.; Shen, C.; Yan, Y. Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, LA, USA, 15–20 June 2019; pp. 3126–3135. [Google Scholar]
- ISPRS Vaihingen 2D Semantic Labeling Dataset. Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html (accessed on 22 December 2021).
- ISPRS Potsdam 2D Semantic Labeling Dataset. Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html (accessed on 22 December 2021).
- Ilke, D.; Krzysztof, K.; David, L.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A challenge to parse the Earth through satellite images. In Proceedings of the 31th IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 31st Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. OCNet: Object Context for Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 2375–2398. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, LA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar]
- Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Semantic Segmentation on Remotely Sensed Images Using an Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning. Remote Sens. 2019, 11, 83. [Google Scholar] [CrossRef]
- Cui, W.; Wang, F.; He, X.; Zhang, D.; Xu, X.; Yao, M.; Wang, Z.; Huang, J. Multi-Scale Semantic Segmentation and Spatial Relationship Recognition of Remote Sensing Images Based on an Attention Model. Remote Sens. 2019, 11, 1044. [Google Scholar] [CrossRef]
- Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909. [Google Scholar] [CrossRef]
- Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, l.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers and Distillation through Attention. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; pp. 10347–10357. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar]
- Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
- Lei, S.; Shi, Z.; Mo, W. Transformer-Based Multistage Enhancement for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
- Foivos, D.; François, W.; Peter, C.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar]
Items | Settings | |
---|---|---|
CB Backbone | ResNet 101 | |
STB Backbone | Swin-S | |
Batch size | 16 | |
Learning strategy | Poly decay | |
Initial learning rate | 0.002 | |
Loss Function | Cross-entropy | |
Optimizer | SGD | |
Max epoch | 500 | |
Sub-patch size | 256 × 256 | |
Data augmentation | Rotate 90, 180, and 270 degrees, horizontally and vertically flip |
Methods | Impervious Surfaces | Building | Low Vegetation | Tree | Car | Mean F1 | OA |
---|---|---|---|---|---|---|---|
FCN-8s [22] | 84.08 | 73.61 | 65.09 | 79.97 | 38.76 | 68.30 | 66.67 |
SegNet [23] | 87.21 | 75.37 | 67.73 | 82.81 | 42.58 | 71.14 | 69.45 |
U-Net [24] | 84.67 | 86.13 | 69.49 | 82.71 | 41.89 | 72.98 | 71.24 |
DeepLab V3+ [60] | 87.99 | 87.80 | 72.72 | 85.55 | 47.37 | 76.29 | 74.47 |
CBAM [46] | 92.38 | 87.95 | 78.81 | 89.46 | 57.60 | 81.24 | 79.30 |
DANet [48] | 91.57 | 90.37 | 80.01 | 88.15 | 58.50 | 81.72 | 79.78 |
ResUNet-a [61] | 92.98 | 95.59 | 85.54 | 89.36 | 91.87 | 91.07 | 88.90 |
SCAttNet [51] | 89.59 | 90.77 | 80.45 | 80.73 | 70.87 | 82.48 | 80.52 |
HCANet [32] | 94.29 | 96.20 | 83.33 | 92.38 | 88.86 | 91.01 | 88.84 |
ICTNet | 94.69 | 96.70 | 86.04 | 92.07 | 92.18 | 92.34 | 90.14 |
Methods | Impervious Surfaces | Building | Low Vegetation | Tree | Car | Mean F1 | OA |
---|---|---|---|---|---|---|---|
FCN-8s [22] | 84.82 | 74.26 | 65.67 | 80.68 | 39.10 | 68.91 | 67.82 |
SegNet [23] | 85.52 | 85.75 | 70.20 | 83.55 | 41.71 | 73.35 | 72.26 |
U-Net [24] | 88.77 | 88.58 | 73.37 | 86.30 | 47.79 | 76.96 | 75.81 |
DeepLab V3+ [60] | 87.44 | 90.20 | 81.03 | 81.03 | 89.51 | 85.84 | 84.48 |
CBAM [46] | 90.67 | 95.69 | 84.54 | 85.44 | 86.55 | 88.58 | 88.25 |
DANet [48] | 91.07 | 96.40 | 83.93 | 83.73 | 93.58 | 89.74 | 88.36 |
ResUNet-a [61] | 93.88 | 97.30 | 88.05 | 88.76 | 96.60 | 92.92 | 91.47 |
SCAttNet [51] | 91.87 | 97.40 | 85.24 | 87.05 | 92.78 | 90.87 | 89.06 |
HCANet [32] | 92.88 | 96.90 | 87.25 | 88.15 | 93.88 | 91.81 | 90.67 |
ICTNet | 93.78 | 97.50 | 88.15 | 88.86 | 96.70 | 93.00 | 91.57 |
Methods | Urban Land | Agriculture Land | Forest Land | Water | Barren Land | Rangeland | Mean F1 | OA |
---|---|---|---|---|---|---|---|---|
FCN-8s [22] | 65.29 | 66.87 | 50.39 | 64.82 | 69.01 | 57.88 | 62.38 | 63.16 |
SegNet [23] | 67.01 | 67.32 | 47.75 | 67.03 | 72.82 | 62.93 | 64.14 | 65.58 |
U-Net [24] | 65.92 | 68.65 | 56.78 | 69.83 | 74.68 | 71.92 | 67.96 | 68.83 |
DeepLab V3+ [60] | 68.97 | 71.82 | 59.41 | 73.06 | 78.14 | 75.24 | 71.11 | 72.34 |
CBAM [46] | 73.36 | 79.77 | 63.43 | 76.74 | 81.15 | 79.87 | 75.72 | 77.65 |
DANet [48] | 75.26 | 79.64 | 65.03 | 77.91 | 81.63 | 81.30 | 76.79 | 78.41 |
ResUNet-a [61] | 80.82 | 83.15 | 72.46 | 76.57 | 83.33 | 82.46 | 79.80 | 80.20 |
SCAttNet [51] | 79.57 | 83.20 | 67.28 | 81.50 | 87.15 | 85.90 | 80.77 | 81.79 |
HCANet [32] | 78.75 | 80.23 | 68.54 | 80.20 | 85.06 | 81.81 | 79.10 | 81.85 |
ICTNet | 87.62 | 90.14 | 78.56 | 85.18 | 92.34 | 91.40 | 87.54 | 86.95 |
Models | Vaihingen | Potsdam | DeepGlobe |
---|---|---|---|
ICTNet | 92.34/90.14 | 93.00/91.57 | 87.54/86.95 |
ICTNet-S | 90.26/87.66 | 90.64/88.79 | 76.51/80.01 |
Models | Vaihingen | Potsdam | DeepGlobe |
---|---|---|---|
ICTNet | 92.34/90.14 | 93.00/91.57 | 87.54/86.95 |
ICTNet-B | 85.50/83.04 | 88.56/86.35 | 74.02/83.06 |
ICTNet-M | 90.13/87.52 | 90.88/89.28 | 75.19/84.37 |
Models | Vaihingen | Potsdam | DeepGlobe |
---|---|---|---|
STB-only | 81.55/79.61 | 89.55/88.17 | 76.63/78.25 |
CB-only | 72.85/71.11 | 77.83/76.63 | 68.70/69.58 |
ICTNet | 92.34/90.14 | 93.00/91.57 | 87.54/86.95 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Xu, F.; Xia, R.; Li, T.; Chen, Z.; Wang, X.; Xu, Z.; Lyu, X. Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation. Remote Sens. 2022, 14, 4065. https://doi.org/10.3390/rs14164065
Li X, Xu F, Xia R, Li T, Chen Z, Wang X, Xu Z, Lyu X. Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation. Remote Sensing. 2022; 14(16):4065. https://doi.org/10.3390/rs14164065
Chicago/Turabian StyleLi, Xin, Feng Xu, Runliang Xia, Tao Li, Ziqi Chen, Xinyuan Wang, Zhennan Xu, and Xin Lyu. 2022. "Encoding Contextual Information by Interlacing Transformer and Convolution for Remote Sensing Imagery Semantic Segmentation" Remote Sensing 14, no. 16: 4065. https://doi.org/10.3390/rs14164065