Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images
<p>The structure of our proposed multi-scale context extraction module. It contains three parts namely A, B, and C. Part A extracts global information. Part B is the parallel connection of atrous convolutions with different dilatation rate. Part C is the feature map itself. GAP stands for global average pooling. Con<math display="inline"><semantics> <mrow> <mn>1</mn> <mo>×</mo> <mn>1</mn> </mrow> </semantics></math> represents a 1 × 1 convolution layer. UP denotes upsample operation. Concat means that the features are concatenated according to channel.</p> "> Figure 2
<p>Structure of the adaptive fusion module. A is a low-level feature map. B is a high-level feature map. B’ is a feature map obtained by B. C is a feature map of A and B’ combined by channel. D is a feature map of the channel changed by C. E is the feature map adjusted with channel weights. F is the final fusion feature map.</p> "> Figure 3
<p>The overall structure of the proposed multi-scale adaptive feature fusion network (MANet). Part A is the backbone network. Part B is a multi-scale context extraction module. Part C is a high- level and low-level feature adaptive fusion module.</p> "> Figure 4
<p>Images of the Potsdam and the Vaihingen dataset and their corresponding labels.</p> "> Figure 5
<p>Precision-recall (PR) curves for each category of the seven models on the Potsdam dataset.</p> "> Figure 6
<p>Visual comparison of seven models on the Potsdam dataset.</p> "> Figure 7
<p>PR curves for each category of the seven models on the Vaihingen dataset.</p> "> Figure 8
<p>Visual comparison of seven models on the Vaihingen dataset.</p> "> Figure 9
<p>Example from the Zurich dataset.</p> ">
Abstract
:1. Introduction
- We propose a multi-scale context extraction module. It consists of a two-layer atrous convolution with different dilatation rate, global information, and information of its own. The multi-scale context extraction module extracts the features of different scales of the image. These features are concatenated to form new features, which are used to tackle the problem of different target sizes in the images.
- We designed a high-level and low-level feature adaptive fusion module. It combines both high- and low-level features to form new features and applies channel attention to these new features to obtain weights. These weights are multiplied with fused features to emphasize useful features and to suppress useless features. This alleviates the problem of misidentification of similar targets in remote sensing images.
- Based on the above model, we construct an end-to-end network called MANet for semantic segmentation in remote sensing images. The performance of our proposed MANet on Potsdam and Vaihingen datasets is compared to other state-of-the-art methods.
2. Multi-Scale Adaptive Feature Fusion Algorithm
2.1. Multi-Scale Context Extraction Module
2.1.1. Global Information Extraction
2.1.2. Parallel Atrous Convolution Multi-Scale Context Extraction
2.2. Adaptive Fusion Module
2.3. Multi-Scale Adaptive Feature Fusion Network (MANet)
3. Experiment and Analysis
3.1. Datasets Description
3.2. Compared State-of-the-Art Methods
3.3. Training Details
3.4. Metrics
3.5. Experimental Results and Analysis
3.5.1. Experiments on the Potsdam Dataset
3.5.2. Experiments on the Vaihingen Dataset
3.6. Ablation Experiment
4. Discussion
4.1. Model Complexity
4.2. Experiments on Small-Scale Dataset
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
FCN | Fully Convolutional Networks |
CNNs | Convolutional Neural Networks |
MANet | Multi-Scale Adaptive Feature Fusion Network |
HOG | Histogram of Oriented Gradient |
SIFT | Scale-Invariant Feature Transform |
GSD | Ground Surface Distance |
DSM | Digital Surface Models |
NDSM | Normalized Digital Surface Model |
IRRG | Infrared, Red and Green |
OA | overall accuracy |
MCM | multi-scale context extraction module |
AFM | adaptive fusion module |
Imp Sur | Impervious Surface |
Low Veg | Low Vegetation |
pre Ave | precision Average |
recall Ave | recall Average |
F1 Ave | F1 Average |
PR | Precision-recall |
References
- Zhang, L.; Zhang, L.; Du, B. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
- Singh, V.; Misra, A.K. Detection of plant leaf diseases using image segmentation and soft computing techniques. Inf. Process. Agric. 2017, 4, 41–49. [Google Scholar] [CrossRef] [Green Version]
- Wen, D.; Huang, X.; Liu, H.; Liao, W.; Zhang, L. Semantic classification of urban trees using very high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1413–1424. [Google Scholar] [CrossRef]
- Shi, Y.; Qi, Z.; Liu, X.; Niu, N.; Zhang, H. Urban Land Use and Land Cover Classification Using Multisource Remote Sensing Images and Social Media Data. Remote Sens. 2019, 11, 2719. [Google Scholar] [CrossRef] [Green Version]
- Matikainen, L.; Karila, K. Segment-based land cover mapping of a suburban area—Comparison of high-resolution remotely sensed datasets using classification trees and test field points. Remote Sens. 2011, 3, 1777–1804. [Google Scholar] [CrossRef] [Green Version]
- Xu, S.; Pan, X.; Li, E.; Wu, B.; Bu, S.; Dong, W.; Xiang, S.; Zhang, X. Automatic building rooftop extraction from aerial images via hierarchical rgb-d priors. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7369–7387. [Google Scholar] [CrossRef]
- Liu, W.; Yang, M.; Xie, M.; Guo, Z.; Li, E.; Zhang, L.; Pei, T.; Wang, D. Accurate Building Extraction from Fused DSM and UAV Images Using a Chain Fully Convolutional Neural Network. Remote Sens. 2019, 11, 2912. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road extraction from high-resolution remote sensing imagery using deep learning. Remote Sens. 2018, 10, 1461. [Google Scholar] [CrossRef] [Green Version]
- Shrestha, S.; Vanneschi, L. Improved fully convolutional network with conditional random fields for building extraction. Remote Sens. 2018, 10, 1135. [Google Scholar] [CrossRef] [Green Version]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
- Zhao, C.; Sun, L.; Stolkin, R. A fully end-to-end deep learning approach for real-time simultaneous 3D reconstruction and material recognition. In Proceedings of the 2017 18th International Conference on Advanced Robotics (ICAR), Hong Kong, China, 10–12 July 2017; pp. 75–82. [Google Scholar]
- Sun, L.; Zhao, C.; Yan, Z.; Liu, P.; Duckett, T.; Stolkin, R. A novel weakly-supervised approach for RGB-D-based nuclear waste object detection. IEEE Sensors J. 2018, 19, 3487–3500. [Google Scholar] [CrossRef] [Green Version]
- Guo, S.; Jin, Q.; Wang, H.; Wang, X.; Wang, Y.; Xiang, S. Learnable gated convolutional neural network for semantic segmentation in remote-sensing images. Remote Sens. 2019, 11, 1922. [Google Scholar] [CrossRef] [Green Version]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Kahaki, S.M.M.; Nordin, M.J.; Ashtari, A.H.; Zahra, S.J. Deformation invariant image matching based on dissimilarity of spatial features. Neurocomputing 2016, 175, 1009–1018. [Google Scholar] [CrossRef]
- Shui, P.L.; Zhang, W.C. Corner detection and classification using anisotropic directional derivative representations. IEEE Trans. Image Process. 2013, 22, 3204–3218. [Google Scholar] [CrossRef] [PubMed]
- Kahaki, S.M.M.; Nordin, M.J.; Ashtari, A.H. Contour-based corner detection and classification by using mean projection transform. Sensors 2014, 14, 4126–4143. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
- Wright, R.E. Logistic regression. In Reading and Understanding Multivariate Statistics; American Psychological Association: Washington, DC, USA, 1995; Chapter 7; pp. 217–244. [Google Scholar]
- Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
- Liu, Y.; Piramanayagam, S.; Monteiro, S.T.; Saber, E. Dense semantic labeling of very-high-resolution aerial imagery and lidar with fully-convolutional neural networks and higher-order CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21 July–26 July 2017; pp. 76–85. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July–26 July 2017; pp. 1492–1500. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July–26 July 2017; pp. 4700–4708. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Wang, Y.; Liang, B.; Ding, M.; Li, J. Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery. Remote Sens. 2019, 11, 20. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July–26 July 2017; pp. 2117–2125. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; others. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Konecny, G. The International Society for Photogrammetry and Remote Sensing (ISPRS) study on the status of mapping in the world. In International Workshop on “Global Geospatial Information”; Citeseer: Novosibirsk, Russian, 2013; pp. 4–24. [Google Scholar]
- Cheng, W.; Yang, W.; Wang, M.; Wang, G.; Chen, J. Context Aggregation Network for Semantic Labeling in Aerial Images. Remote Sens. 2019, 11, 1158. [Google Scholar] [CrossRef] [Green Version]
- Volpi, M.; Tuia, D. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 55, 881–893. [Google Scholar] [CrossRef] [Green Version]
- Nekrasov, V.; Dharmasiri, T.; Spek, A.; Drummond, T.; Shen, C.; Reid, I. Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7101–7107. [Google Scholar]
- Kahaki, S.M.M.; Nordin, M.J.; Ashtari, A.H.; Zahra, S.J. Invariant feature matching for image registration application based on new dissimilarity of spatial features. PLoS ONE 2016, 11, e0149710. [Google Scholar]
- Volpi, M.; Ferrari, V. Semantic segmentation of urban scenes by learning local class interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Epoch | 1–30 | 31–60 | 61–90 | 91–120 | 121–150 | 151–180 |
---|---|---|---|---|---|---|
LearningRate | 0.001 | 0.0005 | 0.0001 | 0.00005 | 0.00001 | 0.000005 |
Models | Imp Sur | Building | Low Veg | Tree | Car | F1 Ave | Pre Ave | Recall Ave | OA |
---|---|---|---|---|---|---|---|---|---|
FCN8s [28] | 0.864 | 0.927 | 0.806 | 0.833 | 0.662 | 0.818 | 0.829 | 0.811 | 0.846 |
U-net [29] | 0.881 | 0.922 | 0.813 | 0.838 | 0.869 | 0.865 | 0.862 | 0.868 | 0.853 |
UZ1 [42] | 0.869 | 0.893 | 0.825 | 0.835 | 0.887 | 0.862 | 0.861 | 0.864 | 0.846 |
DeepLabv3+ [31] | 0.905 | 0.953 | 0.845 | 0.857 | 0.896 | 0.891 | 0.889 | 0.893 | 0.883 |
LWRefineNet [43] | 0.909 | 0.953 | 0.846 | 0.849 | 0.896 | 0.890 | 0.890 | 0.891 | 0.884 |
APPD [32] | 0.910 | 0.958 | 0.848 | 0.853 | 0.894 | 0.893 | 0.898 | 0.889 | 0.884 |
MANet | 0.916 | 0.961 | 0.859 | 0.871 | 0.914 | 0.904 | 0.900 | 0.908 | 0.894 |
Models | Imp Sur | Building | Low Veg | Tree | Car | F1 Ave | Pre Ave | Recall Ave | OA |
---|---|---|---|---|---|---|---|---|---|
FCN8s [28] | 0.838 | 0.890 | 0.753 | 0.822 | 0.326 | 0.726 | 0.781 | 0.710 | 0.823 |
U-net [29] | 0.838 | 0.878 | 0.749 | 0.847 | 0.311 | 0.725 | 0.780 | 0.710 | 0.823 |
UZ1 [42] | 0.872 | 0.902 | 0.788 | 0.863 | 0.728 | 0.830 | 0.838 | 0.825 | 0.855 |
DeepLabv3+ [31] | 0.891 | 0.935 | 0.792 | 0.866 | 0.721 | 0.841 | 0.860 | 0.830 | 0.870 |
LWRefineNet [43] | 0.887 | 0.935 | 0.807 | 0.866 | 0.747 | 0.848 | 0.853 | 0.844 | 0.872 |
APPD [32] | 0.889 | 0.936 | 0.798 | 0.867 | 0.760 | 0.850 | 0.855 | 0.835 | 0.872 |
MANet | 0.902 | 0.941 | 0.809 | 0.870 | 0.812 | 0.867 | 0.870 | 0.867 | 0.882 |
Models | Imp Sur | Building | Low Veg | Tree | Car | mean F1 | OA |
---|---|---|---|---|---|---|---|
Res101 | 0.882 | 0.932 | 0.805 | 0.817 | 0.789 | 0.845 | 0.850 |
Res101+CMM | 0.893 | 0.937 | 0.820 | 0.853 | 0.796 | 0.860 | 0.862 |
Res101+AFM | 0.904 | 0.951 | 0.844 | 0.862 | 0.909 | 0.894 | 0.883 |
Res101+CMM+AFM | 0.916 | 0.961 | 0.859 | 0.871 | 0.914 | 0.904 | 0.894 |
Models | Input Size | Parameters (M) | Flop (GFLOPs) | Test Time (ms) |
---|---|---|---|---|
FCN8s [28] | 512 × 512 | 512 | 189 | 27 |
U-net [29] | 512 × 512 | 27 | 3.5 | 10 |
UZ1 [42] | 512 × 512 | 22 | 221 | 22 |
DeepLabv3+ [31] | 512 × 512 | 226 | 89 | 19 |
LWRefineNet [43] | 512 × 512 | 176 | 51 | 16 |
APPD [32] | 512 × 512 | 229 | 91 | 20 |
MANet | 512 × 512 | 424 | 63 | 20 |
Models | Road | Building | Tree | Grass | Bare Soil | Water | Rail | Pool |
---|---|---|---|---|---|---|---|---|
FCN8s [28] | 0.517 | 0.586 | 609 | 0.669 | 0.587 | 0.868 | 0.023 | 0.622 |
U-net [29] | 0.698 | 0.777 | 0.665 | 0.729 | 0.558 | 0.946 | 0.023 | 0.859 |
UZ1 [42] | 0.693 | 0.784 | 0.695 | 0.756 | 0.510 | 0.946 | 0.316 | 0.813 |
DeepLabv3+ [31] | 0.709 | 0.809 | 0.687 | 0.786 | 0.619 | 0.948 | 0.224 | 0.839 |
LWRefineNet [43] | 0.711 | 0.812 | 0.731 | 0.803 | 0.607 | 0.941 | 0.150 | 0.814 |
APPD [32] | 0.702 | 0.813 | 0.740 | 0.809 | 0.637 | 0.947 | 0.391 | 0.799 |
MANet | 0.711 | 0.819 | 0.732 | 0.791 | 0.669 | 0.951 | 0.262 | 0.867 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images. Remote Sens. 2020, 12, 872. https://doi.org/10.3390/rs12050872
Shang R, Zhang J, Jiao L, Li Y, Marturi N, Stolkin R. Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images. Remote Sensing. 2020; 12(5):872. https://doi.org/10.3390/rs12050872
Chicago/Turabian StyleShang, Ronghua, Jiyu Zhang, Licheng Jiao, Yangyang Li, Naresh Marturi, and Rustam Stolkin. 2020. "Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images" Remote Sensing 12, no. 5: 872. https://doi.org/10.3390/rs12050872