Multi-Scale DenseNets-Based Aircraft Detection from Remote Sensing Images
<p>Structure of dense block. There are two dense connections in this dense block. ”BN + Relu” represents Batch Normalization and the Rectified linear unit activation function. “Conv” represents the convolution layer. The short dashed arrow represents nonlinear transformation <span class="html-italic">H</span>(·). The long dashed arrow represents dense connection.</p> "> Figure 2
<p>Structure of transition layer. <math display="inline"><semantics> <mi>s</mi> </semantics></math> represents the size of feature maps and <math display="inline"><semantics> <mi>t</mi> </semantics></math> represents the number of channels. “BN + Relu” represents Batch Normalization and the Rectified linear unit activation function. “Conv” represents the convolution layer. “Pool” represents the average pooling.</p> "> Figure 3
<p>DenseNet-based feature pyramid network. <math display="inline"><semantics> <mrow> <mi>C</mi> <msub> <mi>i</mi> <mrow> <mi>l</mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> <mi>r</mi> <mi>a</mi> <mi>l</mi> </mrow> </msub> </mrow> </semantics></math> represents output of lateral connection. <math display="inline"><semantics> <mrow> <mi>P</mi> <mi>i</mi> <msub> <mo>′</mo> <mrow> <mi>u</mi> <mi>p</mi> <mi>s</mi> <mi>a</mi> <mi>m</mi> <mi>l</mi> <mi>p</mi> <mi>e</mi> </mrow> </msub> </mrow> </semantics></math> represents output of upsampling. C5 represents the output of the dense block (4).</p> "> Figure 4
<p>Overall framework of aircraft detection from remote sensing images. “Fc” represents the fully connected layer. “Cls” represents the proposal classification layer. “Reg” represents the proposal regression layer. “NMS” represents the non-maximum suppression algorithm. “ROI” represents the region of interest.</p> "> Figure 5
<p>Distribution of aircraft shapes in the training set: (<b>a</b>) width, (<b>b</b>) height, and (<b>c</b>) aspect ratio.</p> "> Figure 6
<p>The P–R curves of different Compact-DenseNets.</p> "> Figure 7
<p>Detection examples of DOTA dataset.</p> "> Figure 7 Cont.
<p>Detection examples of DOTA dataset.</p> "> Figure 8
<p>Detection examples of our method in RSOD and UCAS-AOD.</p> "> Figure 9
<p>Comparison of feature maps between MS-DenseNet-65 and MS-DenseNet-121. (<b>a</b>) Input image; (<b>b</b>) P3 layer feature of MS-Densenet-65; (<b>c</b>) P3 layer feature of MS-Densenet-121; (<b>d</b>) P4 layer feature of MS-Densenet-65; (<b>e</b>) P4 layer feature of MS-Densenet-121; (<b>f</b>) P5 layer feature of MS-Densenet-65; (<b>g</b>) P5 layer feature of MS-Densenet-121;.</p> ">
Abstract
:1. Introduction
- (1)
- We introduced DenseNet as backbone and then constructed a MS-DenseNet with the application of FPN [25], which not only enhances the propagation of features but also comprehensively utilizes both bottom-level high-resolution features and top-level semantic strong features. Additionally, we applied a multi-scale region proposal network (MS-RPN), which can produce multi-scale proposals to be responsible for targets of corresponding scale, ensuring the effectiveness for detecting small aircrafts.
- (2)
- We developed a new compact structure named MS-Densenet-65, which effectively improves the performance of small aircrafts detection, while costing less time in both training and testing. By eliminating some unrequired convolution layers, the Densenet-65 reduces the destruction of the bottom-level high-resolution features and protects the information of small aircraft targets, which are easily submerged by redundant features.
- (3)
- We proposed a multi-scale training strategy and design a suitable testing scale of image in detection, which allows the network to learn aircraft targets at different scales and resolutions, thus improving the robustness and generalization ability of proposed model.
2. Materials and Methods
2.1. Dense Convolutional Network
2.2. Proposed Method
2.2.1. Compact-DenseNets
2.2.2. DenseNet-based Feature Pyramid Networks
- P5′ is generated from C5 through a 1 × 1 convolution layer, which is called lateral connection. This operation reduces the number of feature map channels to 256 and recombines features simultaneously. The expression for this operation is:
- Semantically coarser but higher-resolution feature maps are generated from P5′ via nearest upsampling with a step size of 2. Meanwhile, is generated from C4 via lateral connections:
- Since has the same size as , we fuse them by element-wise addition:
- Using the same operation as step 2–3, P3′ and P2′ are successively generated.
- To eliminate the aliasing effect caused by upsampling, a 3 × 3 convolution layer is adopted on {P2′, P3′, P4′, P5′}, resulting in {P2, P3, P4, P5}. The scale of {P2, P3, P4, P5} corresponds to the original features {C2, C3, C4, C5}. Then, P6, which has the strongest semantic information, is generated from P5´ by downsampling. The feature maps {P2, P3, P4, P5, P6} are ultimately used for prediction.
2.2.3. Multi-Scale Region Proposal Network
2.2.4. Multi-Scale Training
2.2.5. Aircraft Detection Process
3. Experiments and Results
3.1. Implementation Details
3.1.1. Experimental Dataset
3.1.2. Anchor Settings
3.1.3. Parameter Setting
3.2. Evaluation Metrics
3.3. Performance Comparison of Different Compact-DenseNets
3.4. Influence of Different Training and Testing Scales
3.5. Comparison with Other Methods
3.6. Transferability
4. Discussion
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Li, W.; Xiang, S.; Wang, H.; Pan, C. Robust airplane detection in satellite images. In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 2821–2824. [Google Scholar] [CrossRef]
- Liu, G.; Sun, X.; Fu, K.; Wang, H. Aircraft recognition in high-resolution satellite images using coarse-to-fine shape prior. IEEE Geosci. Remote Sens. Lett. 2013, 10, 573–577. [Google Scholar] [CrossRef]
- Bo, S.; Jing, Y. Region-based airplane detection in remotely sensed imagery. In Proceedings of the 3rd International Congress on Image and Signal Processing, Yantai, China, 16–18 October 2010; pp. 1923–1926. [Google Scholar] [CrossRef]
- Yildiz, C.; Polat, E. Detection of stationary aircrafts from satellite images. In Proceedings of the 19th IEEE Signal Processing & Communications Applications Conference, Antalya, Turkey, 20–22 April 2011; pp. 518–521. [Google Scholar] [CrossRef]
- Sun, H.; Sun, X.; Wang, H.; Li, Y.; Li, X. Automatic Target Detection in High-Resolution Remote Sensing Images Using Spatial Sparse Coding Bag-of-Words Model. IEEE Geosci. Remote Sens. Lett. 2011, 9, 109–113. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna, Italy, 13–15 May 2010. [Google Scholar]
- He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van der Matten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 567–576. [Google Scholar] [CrossRef]
- Zhao, B.; Huang, B.; Zhong, Y. Transfer learning with fully pretrained deep convolution networks for land-use classification. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1436–1440. [Google Scholar] [CrossRef]
- Mboga, N.; Georganos, S.; Grippa, T.; Moritz, L.; Vanhuysse, S.; Wolff, E. Fully convolutional networks and geographic object-based image analysis for the classification of VHR imagery. Remote Sens. 2019, 11, 597. [Google Scholar] [CrossRef]
- Hu, S.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
- Han, X.; Zhong, Y.; Zhang, L. An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens. 2017, 9, 666. [Google Scholar] [CrossRef]
- Ammour, N.; Alihichri, H.; Bazi, Y.; Benjdira, B.; Alajlan, N.; Zuair, M. Deep learning approach for car detection in UAV imagery. Remote Sens. 2017, 9, 312. [Google Scholar] [CrossRef]
- Cai, B.; Jiang, Z.; Zhang, H.; Zhao, D.; Yao, Y. Airport detection using end-to-end convolutional neural network with hard example mining. Remote Sens. 2017, 9, 1198. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
- Uijlings, J.R.R.; Van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Cai, Z.W.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. arXiv 2017, arXiv:1712.00726. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 207 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Fu, C.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; p. 1. [Google Scholar] [CrossRef]
- Xie, H.; Wang, T.; Qiao, M.; Zhang, M.; Shan, G.; Snoussi, H. Robust object detection for tiny and dense targets in VHR aerial images. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 6397–6401. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- Ren, Y.; Zhu, C.; Xiao, S. Deformable Faster R-CNN with Aggregating Multi-Layer Features for Partially Occluded Object Detection in Optical Remote Sensing Images. Remote Sens. 2018, 10, 1470. [Google Scholar] [CrossRef]
- Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial Object Detection in High Resolution Satellite Images Based on Multi-Scale Convolutional Neural Network. Remote Sens. 2018, 10, 131. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, H.; Zhang, J.; Yang, W. Airplane Detection in Remote Sensing Image Based on Faste-R CNN Algorithm. J. Nanjing Norm. Univ. Nat. Sci. Ed. 2018, 41, 85–92. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Zhuang, S.; Wang, P.; Jiang, B.; Wang, G.; Wang, C. A Single Shot Framework with Multi-Scale Feature Fusion for Geospatial Object Detection. Remote Sens. 2019, 11, 594. [Google Scholar] [CrossRef]
- Zheng, Z.; Liu, Y.; Pan, C.; Li, G. Application of improved YOLO V3 for remote sensing image aircraft recognition. Electron. Opt. Control 2019, 26, 32–36. [Google Scholar]
- Guo, Z.; Song, P.; Zhang, Y.; Yan, M.; Sun, X.; Sun, H. Aircraft detection method based on deep convolutional neural network for remote sensing images. J. Electron. Inf. Technol. 2018, 40, 149–155. [Google Scholar]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
- Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar] [CrossRef]
- Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Layer | DenseNet-41 | DenseNet-65 | DenseNet-77 | |
---|---|---|---|---|
Conv_1 | 7 × 7 conv, stride2 | |||
Pool_1 | 3 × 3 max pool, stride2 | |||
Dense block (1) | ||||
Transition Layer | Conv_2 | 1 × 1 conv | ||
Pool_2 | 2 × 2 average pool, stride2 | |||
Dense block (2) | ||||
Transition Layer | Conv_3 | 1 × 1 conv | ||
Pool_3 | 2 × 2 average pool, stride2 | |||
Dense block (3) | ||||
Transition Layer | Conv_4 | 1 × 1 conv | ||
Pool_4 | 2 × 2 average pool, stride2 | |||
Dense block (4) | ||||
Classification Layer | 7 × 7 global average pool | |||
Fully-connected layer | ||||
SoftMax |
Target | Target Amount | Percentage |
---|---|---|
Small targets | 1854 | 48.41% |
Medium targets | 1209 | 31.56% |
Large targets | 767 | 20.03% |
Total | 3830 | 100% |
Model | Recall | Precision | F1-Score | Train Time (s/iter) | Test Time (s/image) |
---|---|---|---|---|---|
MS-DenseNet-41 | 0.897 | 0.89 | 0.893 | 0.120 | 0.089 |
MS-DenseNet-65 | 0.918 | 0.917 | 0.917 | 0.156 | 0.094 |
MS-DenseNet-77 | 0.905 | 0.926 | 0.915 | 0.17 | 0.096 |
MS-DenseNet-121 | 0.904 | 0.923 | 0.913 | 0.193 | 0.124 |
Model | Recall | ||
---|---|---|---|
Small Targets | Medium Targets | Large Targets | |
MS-DenseNet-41 | 0.825 | 0.970 | 0.973 |
MS-DenseNet-65 | 0.860 | 0.972 | 0.966 |
MS-DenseNet-77 | 0.839 | 0.969 | 0.964 |
MS-DenseNet-121 | 0.825 | 0.974 | 0.973 |
Group | Train Scale | Test Scale | Recall | Precision | F1-Score | Test Time (s/image) |
---|---|---|---|---|---|---|
1 | 768 | 768 | 0.907 | 0.909 | 0.908 | 0.077 |
2 | 1024 | 0.932 | 0.897 | 0.912 | 0.094 | |
3 | 1280 | 0.898 | 0.877 | 0.887 | 0.128 | |
4 | 896 | 896 | 0.907 | 0.909 | 0.908 | 0.088 |
5 | 1152 | 0.933 | 0.906 | 0.919 | 0.115 | |
6 | 1408 | 0.918 | 0.890 | 0.904 | 0.147 | |
7 | 1024 | 1024 | 0.918 | 0.917 | 0.917 | 0.095 |
8 | 1280 | 0.926 | 0.907 | 0.916 | 0.126 | |
9 | 1536 | 0.919 | 0.891 | 0.905 | 0.179 | |
10 | 1152 | 1152 | 0.916 | 0.920 | 0.918 | 0.110 |
11 | 1408 | 0.934 | 0.913 | 0.923 | 0.145 | |
12 | 1664 | 0.925 | 0.901 | 0.913 | 0.204 | |
13 | 1280 | 1280 | 0.919 | 0.923 | 0.921 | 0.129 |
14 | 1536 | 0.935 | 0.914 | 0.924 | 0.178 | |
15 | 1792 | 0.927 | 0.908 | 0.917 | 0.238 | |
16 | 768,896,1024,1152,1280 | 768 | 0.911 | 0.923 | 0.917 | 0.076 |
17 | 1024 | 0.940 | 0.914 | 0.927 | 0.094 | |
18 | 1280 | 0.940 | 0.903 | 0.921 | 0.127 | |
19 | 1536 | 0.930 | 0.888 | 0.908 | 0.178 |
Method | Backbone | Recall | Precision | F1-Score | Train Time (s/iter) | Test Time (s/image) |
---|---|---|---|---|---|---|
Faster RCNN | ResNet-50 | 0.890 | 0.946 | 0.917 | 0.545 | 0.111 |
ResNet-101 | 0.886 | 0.948 | 0.916 | 0.683 | 0.134 | |
RetinaNet | ResNet-50 | 0.860 | 0.920 | 0.889 | 0.142 | 0.086 |
SSD | VGGNet-16 | 0.650 | 0.943 | 0.774 | 0.102 | 0.025 |
Ours | MS-DenseNet-65 | 0.940 | 0.914 | 0.927 | 0.168 | 0.094 |
Method | Backbone | Recall | ||
---|---|---|---|---|
Small Targets | Medium Targets | Large Targets | ||
Faster RCNN | ResNet-50 | 0.799 | 0.978 | 0.974 |
ResNet-101 | 0.791 | 0.978 | 0.971 | |
RetinaNet | ResNet-50 | 0.738 | 0.986 | 0.957 |
SSD | VGGNet-16 | 0.405 | 0.891 | 0.943 |
Ours | MS-DenseNet-65 | 0.898 | 0.982 | 0.973 |
Target | UCAS-AOD | RSOD | ||
---|---|---|---|---|
Target Amount | Percentage | Target Amount | Percentage | |
Small targets | 2531 | 33.83% | 4148 | 77.19% |
Medium targets | 4402 | 58.83% | 1226 | 22.81% |
Large targets | 549 | 7.34% | 0 | 0 |
Total | 7482 | 100% | 5374 | 100% |
Dataset | Recall | Precision | F1-Score | ||
---|---|---|---|---|---|
Small Targets | Medium Targets | Large Targets | |||
UCAS-AOD | 0.934 | 0.987 | 0.990 | 0.957 | 0.963 |
RSOD | 0.887 | 0.974 | − | 0.948 | 0.928 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Li, H.; Jia, P.; Zhang, G.; Wang, T.; Hao, X. Multi-Scale DenseNets-Based Aircraft Detection from Remote Sensing Images. Sensors 2019, 19, 5270. https://doi.org/10.3390/s19235270
Wang Y, Li H, Jia P, Zhang G, Wang T, Hao X. Multi-Scale DenseNets-Based Aircraft Detection from Remote Sensing Images. Sensors. 2019; 19(23):5270. https://doi.org/10.3390/s19235270
Chicago/Turabian StyleWang, Yantian, Haifeng Li, Peng Jia, Guo Zhang, Taoyang Wang, and Xiaoyun Hao. 2019. "Multi-Scale DenseNets-Based Aircraft Detection from Remote Sensing Images" Sensors 19, no. 23: 5270. https://doi.org/10.3390/s19235270
APA StyleWang, Y., Li, H., Jia, P., Zhang, G., Wang, T., & Hao, X. (2019). Multi-Scale DenseNets-Based Aircraft Detection from Remote Sensing Images. Sensors, 19(23), 5270. https://doi.org/10.3390/s19235270