Abstract
With the development of deep learning, Remote Sensing Image (RSI) semantic segmentation has produced significant advances. However, due to the sparse distribution of the objects and the high similarity between classes, the task of semantic segmentation in RSI is still extremely challenging. In this paper, we propose a novel semantic segmentation framework for RSI called HST-UNet that can overcome the shortcomings of the existing models and extract and recover the global and local features of RSI, which is a hybrid semantic segmentation model with Shunted Transformer as encoder and Multi-Scale Convolutional Attention Network (MSCAN) as decoder. Then, to better fuse the information from the Encoder and the Decoder and alleviate the ambiguity, we design a Learnable Weighted Fusion (LWF) module to effectively connect to the decoder features. Extensive experiments demonstrate that the proposed HST-UNet outperforms the state-of-the-art methods, achieving F1 score/MIoU accuracy of 71.44%/83.00% on the ISPRS Vaihingen dataset and 77.36%/87.09% on ISPRS Potsdam dataset. The code will be available at https://github.com/HC-Zhou/HST-UNet.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available at http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html with corresponding permission.
References
Luo H, Chen C, Fang L, Khoshelham K, Shen G (2020) Ms-rrfsegnet: multiscale regional relation feature segmentation network for semantic segmentation of urban scene point clouds. IEEE Trans Geosci Remote Sens 58(12):8301–8315
Neupane B, Horanont T, Aryal J (2021) Deep learning-based semantic segmentation of urban features in satellite images: a review and meta-analysis. Remote Sens 13(4):808
Yuhua C, Wen L, Luc VG (2018) Road: reality oriented adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7892–7901
Ji W, Xiaofan X, Murambadoro D (2015) Understanding urban wetland dynamics: cross-scale detection and analysis of remote sensing. Int J Remote Sens 36(7):1763–1788
Granholm A-H, Lindgren N, Olofsson K, Nyström M, Allard A, Olsson H (2017) Estimating vertical canopy cover using dense image-based point cloud data in four vegetation types in southern sweden. Int J Remote Sens 38(7):1820–1838
Shahbazi M, Théau J, Ménard P (2014) Recent applications of unmanned aerial imagery in natural resource management. GISci Remote Sens 51(4):339–365
Clarke JDA, Gibson D, Apps H (2010) The use of lidar in applied interpretive landform mapping for natural resource management, murray river alluvial plain, australia. Int J Remote Sens 31(23):6275–6296
Weber E, Kane H (2020) Building disaster damage assessment in satellite imagery with multi-temporal fusion. arXiv preprint arXiv:2004.05525
Chen W-J, Li C-C (2002) Rain retrievals using tropical rainfall measuring mission and geostationary meteorological satellite 5 data obtained during the scsmex. Int J Remote Sens 23(12):2425–2448
Kaiming H, Xiangyu Z, Shaoqing R, Jian S (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11976–11986
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical Image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, Springer, pp 234–241
Zhao H, Sh, J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp 2286–2296. PMLR
Fukui H, Hirakawa T, Yamashita T, Fujiyoshi H (2019) Attention branch network: learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10705–10714
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) Psanet: point-wise spatial attention network for scene parsing. In Proceedings of the European conference on computer vision (ECCV), pp 267–283
Yuan Y, Huang L, Guo J, Zhang C, Chen X, Wang J (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916
Guo MH, Lu CZ, Hou Q, Liu Z, Cheng MM, Hu SM (2022) Segnext: rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, vol 2, p 4
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Ren S, Zhou D, He S, Feng J, Wang X (2022) Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10853–10862
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
He X, Zhou Y, Zhao J, Zhang D, Yao R, Xue Y (2022) Swin transformer embedding unet for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens 60:1–15
Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th international conference on information technology in medicine and education (ITME), IEEE, pp 327–331
Zhou L, Zhang C, Wu M (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 182–186
Wang J, Long X, Chen G, Wu Z, Chen Z, Ding E (2022) U-hrnet: delving into improving semantic representation of high resolution network for dense prediction. arXiv preprint arXiv:2210.07140
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710
Broni-Bediako C, Murata Y, Mormille LH, Atsumi M (2021) Evolutionary nas for aerial image segmentation with gene expression programming of cellular encoding. Neural Comput Appl 1–20
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp 3–19
Zhu H, Zhang M, Zhang X, Zhang L (2021) Two-branch encoding and iterative attention decoding network for semantic segmentation. Neural Comput Appl 33:5151–5166
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr Philip HS et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578
Zongwei Zhou Md, Siddiquee MR, Tajbakhsh N, Liang J (2019) Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39(6):1856–1867
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu, J. (2020) Unet 3+: a full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1055–1059
Mubashar M, Ali H, Grönlund C, Azmat S (2022) R2u++: a multiscale recurrent residual u-net with dense skip connections for medical image segmentation. Neural Comput Appl 34(20):17723–17739
Ibtehaz N, Sohel RM (2020) Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Netw 121:74–87
Wang H, Cao P, Wang J, Zaiane OR (2022) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2441–2449
Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) High-resolution aerial image labeling with convolutional neural networks. IEEE Trans Geosci Remote Sens 55(12):7092–7103
Liu Y, Minh Nguyen D, Deligiannis N, Ding W, Munteanu A (2017) Hourglass-shapenetwork based semantic segmentation for high resolution aerial imagery. Remote Sens 9(6):522
Volpi M, Tuia D (2016) Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans Geosci Remote Sens 55(2):881–893
Lichao M, Yuansheng H, Xiao Xiang Z (2020) Relation matters: relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans Geosci Remote Sens 58(11):7557–7569
Li X, He H, Li X, Li D, Cheng G, Shi J, Lubin W, Yunhai T, Lin Z (2021) Pointflow: flowing semantics through points for aerial image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4217–4226
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp 801–818
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp 418–434
Bai L, Lin X, Ye Z, Xue D, Yao C, Hui M (2022) Msanlfnet: semantic segmentation network with multiscale attention and nonlocal filters for high-resolution remote sensing images. IEEE Geosci Remote Sens Lett 19:1–5
Acknowledgments
This study was supported by National Natural Science Foundation of China (Grant Nos. 62006049, 62172113, and 62072123), Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515010939), Project of Education Department of Guangdong Province (Grant Nos. 2022KTSCX068 and 2020ZDZX3059), The Ministry of education of Humanities and Social Science project (Grant No. 18JDGC012), Guangdong Science and Technology Project (Grant Nos. KTP20210197 and 2017A040403068), and Guangdong Science and Technology Innovation Strategy Special Fund Project (Climbing Plan) (No. pdjh2022b0302).
Author information
Authors and Affiliations
Contributions
Huacong Zhou performed conceptualization, software, and writing—original draft; Xiangling Xiao performed validation, formal analysis, and writing—review and editing; Huihui Li performed methodology and funding acquisition; Xiaoyong Liu performed supervision and funding acquisition; and Peng Liang performed resources, validation, investigation, and funding acquisition.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, H., Xiao, X., Li, H. et al. Hybrid Shunted Transformer embedding UNet for remote sensing image semantic segmentation. Neural Comput & Applic 36, 15705–15720 (2024). https://doi.org/10.1007/s00521-024-09888-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09888-4