Transformer with Transfer CNN for Remote-Sensing-Image Object Detection
"> Figure 1
<p>The overview architecture of the proposed Transformer-based RSI object-detection framework.</p> "> Figure 2
<p>The framework of the proposed TRD.</p> "> Figure 3
<p>The diagram of the deformable attention module.</p> "> Figure 4
<p>The framework of the proposed attention-based transferring backbone.</p> "> Figure 5
<p>Several examples of composite samples for RSI object detection. The samples are from the DIOR data set, whose size is <math display="inline"><semantics> <mrow> <mn>800</mn> <mo>×</mo> <mn>800</mn> </mrow> </semantics></math>, and their spatial resolutions are between 30 m and 0.5 m. The size of composite images is set at <math display="inline"><semantics> <mrow> <mn>1600</mn> <mo>×</mo> <mn>1600</mn> </mrow> </semantics></math>.</p> "> Figure 5 Cont.
<p>Several examples of composite samples for RSI object detection. The samples are from the DIOR data set, whose size is <math display="inline"><semantics> <mrow> <mn>800</mn> <mo>×</mo> <mn>800</mn> </mrow> </semantics></math>, and their spatial resolutions are between 30 m and 0.5 m. The size of composite images is set at <math display="inline"><semantics> <mrow> <mn>1600</mn> <mo>×</mo> <mn>1600</mn> </mrow> </semantics></math>.</p> "> Figure 6
<p>Qualitative inference results of T-TRD-DA on the NWPU VHR-10 data set.</p> "> Figure 7
<p>Comparison between T-TRD-DA (<b>left</b>) and YOLO v3 (<b>right</b>) on the NWPU VHR-10 data set.</p> "> Figure 8
<p>Qualitative inference results on the DIOR data set.</p> "> Figure 9
<p>The p–r curve of the detectors on each category of the DIOR data set.</p> "> Figure 9 Cont.
<p>The p–r curve of the detectors on each category of the DIOR data set.</p> ">
Abstract
:1. Introduction
2. The Proposed Transformer-Based RSI Object-Detection Framework
2.1. The Framework of the Proposed TRD
2.2. The Deformable Attention Module
2.3. The Attention-Based Transferring Backbone
2.4. Data Augmentation for RSI Object Detection
3. Data Sets and Experimental Settings
3.1. Data Description
3.2. Evaluation Metrics
3.3. Baseline Methods
3.4. Implementation Details
4. Experimental Results and Discussion
4.1. Comparison Results on the NWPU VHR-10 Data Set
4.2. Comparison Results on the DIOR Data Set
4.3. Ablation Experiments
4.4. Comparison of the Computational Complexity and Inference Speed
4.5. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef] [Green Version]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2019, 159, 296–307. [Google Scholar] [CrossRef]
- Lou, X.; Huang, D.; Fan, L.; Xu, A. An image classification algorithm based on bag of visual words and multi-kernel learning. J. Multimed. 2014, 9, 269–277. [Google Scholar] [CrossRef] [Green Version]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Sun, H.; Sun, X.; Wang, H.; Li, Y.; Li, X. Automatic target detection in high-resolution remote sensing images using spatial sparse coding bag-of-words model. IEEE Geosci. Remote Sens. Lett. 2012, 9, 109–113. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015. [Google Scholar]
- Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Zhou, P.; Xu, D. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Trans. Image Process. 2019, 28, 265–278. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337–2348. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhang, X.; Zhu, K.; Chen, G.; Tan, X.; Zhang, L.; Dai, F.; Liao, P.; Gong, Y. Geospatial object detection on high resolution remote sensing imagery based on double multi-scale feature pyramid network. Remote Sens. 2019, 11, 755. [Google Scholar] [CrossRef] [Green Version]
- Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
- Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial object detection in high resolution satellite images based on multi-scale convolutional neural network. Remote Sens. 2018, 10, 131. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Pham, M.-T.; Courtrai, L.; Friguet, C.; Lefèvre, S.; Baussard, A. YOLO-Fine: One-Stage Detector of Small Objects Under Various Backgrounds in Remote Sensing Images. Remote Sens. 2020, 12, 2501. [Google Scholar] [CrossRef]
- Alganci, U.; Soydas, M.; Sertel, E. Comparative Research on Deep Learning Approaches for Airplane Detection from Very High-Resolution Satellite Images. Remote Sens. 2020, 12, 458. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zhuang, S.; Wang, P.; Jiang, B.; Wang, G.; Wang, C. A Single Shot Framework with Multi-Scale Feature Fusion for Geospatial Object Detection. Remote Sens. 2019, 11, 594. [Google Scholar] [CrossRef] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2021. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
- Nicolas, C.; Francisco, M.; Gabriel, S.; Nicolas, U.; Alexander, K.; Sergey, Z. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
- Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. arXiv 2021, arXiv:2107.02988. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens. 2021, 13, 4143. [Google Scholar] [CrossRef]
- Zheng, Y.; Sun, P.; Zhou, Z.; Xu, W.; Ren, Q. ADT-Det: Adaptive Dynamic Refined Single-Stage Transformer Detector for Arbitrary-Oriented Object Detection in Satellite Optical Imagery. Remote Sens. 2021, 13, 2623. [Google Scholar] [CrossRef]
- Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Oquab, M.; Bottou, L.; Laptev, I.; Josef, S. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Lin, Z.; Feng, M.; Santos, C.N.D.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Aurelio, Y.; Almeida, G.; Castro, C.; Braga, A. Learning from imbalanced data sets with weighted cross-entropy function. Neural Process. Lett. 2019, 50, 1937–1949. [Google Scholar] [CrossRef]
- Michael, C. The DGPF-test on digital airborne camera evaluation overview and test design. PFG Photogramm.-Fernerkund. Geoinf. 2010, 2, 73–82. [Google Scholar]
- Han, X.; Zhong, Y.; Zhang, L. An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens. 2017, 9, 666. [Google Scholar] [CrossRef] [Green Version]
- Xu, Z.; Xu, X.; Wang, L.; Yang, R.; Pu, F. Deformable ConvNet with aspect ratio constrained NMS for object detection in remote sensing imagery. Remote Sens. 2017, 9, 1312. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Method | AP (×100) for Each Category | mAP (×100) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Plane | Ship | ST | BD | TC | BC | GTF | Harbor | Bridge | Vehicle | ||
SSCBOW [5] | 50.6 | 50.8 | 33.4 | 43.5 | 00.3 | 15.0 | 10.1 | 58.3 | 12.5 | 33.6 | 30.8 |
COPD [6] | 62.3 | 68.9 | 63.7 | 83.3 | 32.1 | 36.3 | 85.3 | 55.3 | 14.8 | 44.0 | 54.6 |
RICNN [10] | 88.4 | 77.3 | 85.3 | 88.1 | 40.8 | 58.5 | 85.7 | 68.6 | 61.5 | 71.1 | 72.6 |
R-P-Faster R-CNN [39] | 90.4 | 75.0 | 44.4 | 89.9 | 79.0 | 77.6 | 87.7 | 79.1 | 68.2 | 73.2 | 76.5 |
Yolo v3 [20] | 90.6 | 63.1 | 70.9 | 94.8 | 83.8 | 68.6 | 92.1 | 76.2 | 58.1 | 65.7 | 76.4 |
Deformable R-FCN [40] | 87.3 | 81.4 | 63.6 | 90.4 | 81.6 | 74.1 | 90.3 | 75.3 | 71.4 | 75.5 | 79.1 |
Faster RCNN [12] | 92.0 | 76.0 | 54.1 | 95.4 | 75.6 | 71.3 | 90.1 | 76.0 | 69.0 | 63.8 | 76.3 |
Faster RCNN with FPN [17] | 93.9 | 72.3 | 68.2 | 95.7 | 91.9 | 75.6 | 88.5 | 86.4 | 66.8 | 80.9 | 82.0 |
TRD | 99.4 | 78.2 | 84.4 | 94.2 | 82.0 | 83.9 | 98.9 | 78.4 | 56.9 | 72.2 | 82.9 |
T-TRD-DA | 99.0 | 81.0 | 79.6 | 98.1 | 89.2 | 88.3 | 86.5 | 92.6 | 74.7 | 89.6 | 87.9 |
Method | mAP (×100) | mAP (×100) for Objects of Different Scales | ||
---|---|---|---|---|
Large | Middle | Small | ||
YOLO v3 | 76.4 | 74.2 | 69.9 | 52.3 |
Faster RCNN | 76.3 | 76.5 | 73.0 | 35.2 |
Faster RCNN with FPN | 82.0 | 77.4 | 79.5 | 47.9 |
TRD | 82.9 | 79.8 | 75.6 | 43.7 |
T-TRD-DA | 87.9 | 80.8 | 83.6 | 65.7 |
Method | RICNN [10] | YOLO v3 [20] | Faster RCNN [12] | Faster RCNN with FPN [17] | Mask RCNN with FPN [41] | TRD | T-TRD-DA | |
---|---|---|---|---|---|---|---|---|
AP (×100) for Each Class | Airplane | 39.1 | 72.2 | 57.6 | 63.2 | 53.8 | 72.9 | 77.9 |
Airport | 61.0 | 29.2 | 68.6 | 61.3 | 72.3 | 79.3 | 80.5 | |
Baseball Field | 60.1 | 74.0 | 62.4 | 66.3 | 63.2 | 70.0 | 70.1 | |
Basketball Court | 66.3 | 78.6 | 83.7 | 85.5 | 81.0 | 83.8 | 86.3 | |
Bridge | 25.3 | 31.2 | 31.2 | 36.0 | 38.7 | 38.8 | 39.7 | |
Chimney | 63.3 | 69.7 | 73.9 | 73.9 | 72.6 | 77.8 | 77.9 | |
Dam | 41.1 | 26.9 | 42.2 | 45.0 | 55.9 | 58.5 | 59.3 | |
ESA | 51.7 | 48.6 | 55.0 | 56.9 | 71.6 | 57.6 | 59.0 | |
ETA | 36.6 | 54.4 | 46.4 | 49.0 | 67.0 | 57.0 | 54.4 | |
Golf course | 55.9 | 31.1 | 65.6 | 73.2 | 73.0 | 75.2 | 74.6 | |
Ground-track field | 58.9 | 61.1 | 61.4 | 67.5 | 75.8 | 70.5 | 73.9 | |
Harbor | 43.5 | 44.9 | 52.2 | 48.9 | 44.2 | 44.2 | 49.2 | |
Overpass | 39.0 | 49.7 | 51.0 | 54.7 | 56.5 | 55.0 | 57.8 | |
Ship | 9.1 | 87.4 | 48.0 | 73.2 | 71.9 | 73.5 | 74.2 | |
Stadium | 61.1 | 70.6 | 51.0 | 62.8 | 58.6 | 52.1 | 61.1 | |
Storage Tank | 19.1 | 68.7 | 35.3 | 68.3 | 53.6 | 67.6 | 69.8 | |
Tennis Court | 63.5 | 87.3 | 73.5 | 78.7 | 81.1 | 82.5 | 84.0 | |
Train Station | 46.1 | 29.4 | 50.3 | 51.4 | 54.0 | 56.0 | 58.8 | |
Vehicle | 11.4 | 48.3 | 29.8 | 48.1 | 43.1 | 47.0 | 50.5 | |
Wind Mill | 31.5 | 78.7 | 69.6 | 70.1 | 81.1 | 73.2 | 77.2 | |
mAP (×100) | 44.2 | 57.1 | 55.4 | 61.7 | 63.5 | 64.6 | 66.8 |
Method | mAP (×100) | mAP (×100) for Objects of Different Scales | ||
---|---|---|---|---|
Large | Middle | Small | ||
Faster RCNN | 55.4 | 80.7 | 43.7 | 7.9 |
Faster RCNN with FPN | 61.7 | 81.9 | 45.7 | 18.4 |
TRD | 64.6 | 83.2 | 50.1 | 20.4 |
T-TRD-DA | 66.8 | 93.1 | 67.2 | 33.3 |
Method | mAP on NWPU VHR-10 | mAP on DIOR |
---|---|---|
TRD | 0.829 | 0.646 |
T-TRD | 0.835 | 0.650 |
TRD-DA | 0.866 | 0.664 |
T-TRD-DA | 0.879 | 0.668 |
Method | NWPU VHR-10 | DIOR | ||
---|---|---|---|---|
FLOPs (G) | Inference FPS | FLOPs (G) | Inference FPS | |
YOLO v3 | 121.27 | 42.8 | 121.41 | 33.2 |
Faster RCNN | 127.91 | 27.5 | 127.93 | 26.3 |
Faster RCNN with FPN | 135.25 | 22.4 | 135.30 | 19.6 |
TRD | 125.63 | 14.2 | 125.67 | 13.2 |
T-TRD-DA | 125.70 | 13.2 | 125.74 | 12.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens. 2022, 14, 984. https://doi.org/10.3390/rs14040984
Li Q, Chen Y, Zeng Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sensing. 2022; 14(4):984. https://doi.org/10.3390/rs14040984
Chicago/Turabian StyleLi, Qingyun, Yushi Chen, and Ying Zeng. 2022. "Transformer with Transfer CNN for Remote-Sensing-Image Object Detection" Remote Sensing 14, no. 4: 984. https://doi.org/10.3390/rs14040984
APA StyleLi, Q., Chen, Y., & Zeng, Y. (2022). Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sensing, 14(4), 984. https://doi.org/10.3390/rs14040984