FFEDet: Fine-Grained Feature Enhancement for Small Object Detection
<p>The whole pipeline of the proposed FFEDet. Utilization of DarkNet53 as the primary network architecture for feature extraction at four distinct levels. The extracted features are further enhanced using ECFA-PAN to improve their quality. Finally, we conduct object detection on three feature maps that contain rich semantic information at varying levels.</p> "> Figure 2
<p>Composition of the CBAM structure. Channel attention module (CAM) and spatial attention module (SAM), which incorporate operations such as global average pooling (GAP) and global max pooling (GMP) along the spatial dimensions. Channel average pooling (CAP) and channel max pooling (CMP) are performed along the channel dimensions.</p> "> Figure 3
<p>Composition of the SPPCSPC structure. The SPPCSPC module processes the input feature map through multi-scale pooling and convolution operations to generate higher-dimensional features, followed by multiple convolutions and concatenations, ultimately outputting the enhanced feature map.</p> "> Figure 4
<p>Composition of the ECFA structure, which receives feature maps from three hierarchical scales, namely <math display="inline"><semantics> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mo>−</mo> <mn>1</mn> </mrow> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>P</mi> <mi>i</mi> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>. Downsampling and upsampling operations are applied to <math display="inline"><semantics> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mo>−</mo> <mn>1</mn> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </semantics></math>. The resulting outcomes are then added to <math display="inline"><semantics> <msub> <mi>P</mi> <mi>i</mi> </msub> </semantics></math> and concatenated, with spatial and channel attention mechanisms applied.</p> "> Figure 5
<p>The Structure of CSP, ELAN, and E-ELAN. (<b>a</b>) CSP: the input is passed through two branches. One of the branches employs a recurrent residual structure for multiple iterations. Then, the outputs are concentrated from both branches along the channel dimension. (<b>b</b>) ELAN and E-ELAN: in the ELAN module, feature fusion is accomplished by integrating the output of each stacked module layer, while in the E-ELAN model, feature fusion is achieved by incorporating the output of each convolutional layer within the stacked module.</p> "> Figure 6
<p>The Structure of SEConv and S-ELAN. (<b>a</b>) SEConv: employing varied receptive field convolutions facilitates the extraction of multi-scale information, e.g., extracting inter-channel dependency information using pointwise convolutions. (<b>b</b>) S-ELAN: the input splits into two branches, employing stacked SEConv modules and residual fusion techniques.</p> "> Figure 7
<p>The partial examples from the dataset, labeled from (<b>a</b>–<b>d</b>), correspond to PASCAL VOC, VisDrone-DET2021, TGRS-HRRSD, and DOTAv1.0, respectively.</p> "> Figure 8
<p>Qualitative examples of small object scene detection on Pascal VOC. (<b>a</b>) Qualitative example of image 1. (<b>b</b>) Qualitative example of image 2. (<b>c</b>) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv7 algorithm, and our algorithm.</p> "> Figure 9
<p>Qualitative examples of small object scene detection on VisDrone-DET2021. (<b>a</b>) Qualitative example of image 1. (<b>b</b>) Qualitative example of image 2. (<b>c</b>) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.</p> "> Figure 10
<p>Qualitative examples of small object scene detection on TGRS-HRRSD. (<b>a</b>) Qualitative example of image 1. (<b>b</b>) Qualitative example of image 2. (<b>c</b>) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.</p> "> Figure 11
<p>Qualitative examples of small object scene detection on DOTAv1.0. (<b>a</b>) Qualitative example of image 1. (<b>b</b>) Qualitative example of image 2. (<b>c</b>) Qualitative example of image 3. Each row displays, from left to right, the detection results consisting of the ground truth bounding boxes, the YOLOv5 algorithm, the YOLOv7 algorithm, the YOLOv8 algorithm, and our algorithm.</p> ">
Abstract
:1. Introduction
- We present the efficient cross-scale information fusion attention (ECFA) module, which efficiently fuses information across different scales through attention mechanisms. This module effectively reduces feature redundancy while improving the representation of small objects;
- We develop SEConv, a simple and highly efficient convolutional module that effectively reduces computational redundancy and provides multi-scale receptive fields, resulting in enhanced feature learning capabilities;
- We design DFSLoss, a dynamic focal sample weighting function, to overcome the issue of imbalanced hard and easy samples and improve network model optimization. Moreover, we introduce Wise-IoU to alleviate the negative effects of poor-quality examples on model convergence.
2. Related Work
2.1. Scale-Aware Methods
2.2. Feature Fusion Methods
2.3. Context Modeling Methods
2.4. Loss Functions
3. Methods
3.1. Revisiting YOLOv7
3.2. Efficient Cross-Scale Information Fusion Attention
3.3. Simple and Efficient Convolution Module
3.4. Dynamic Focal Sample Weighting Function
3.5. Similarity Measurement
4. Experiments
4.1. Implementation Details
4.2. Datastes
4.3. Evaluation Metrics
4.4. Result and Analysis
4.5. Ablation Study
Method | Params(M) | FLOPs | AP | |
---|---|---|---|---|
Conv(Baseline) | 37.62 | 106.5 | 49.1 | 28.0 |
DWConv | 34.12 | 98.3 | 48.9 | 27.4 |
SCConv | 35.36 | 99.9 | 48.4 | 26.8 |
PConv | 30.52 | 80.3 | 48.5 | 27.3 |
GSConv | 29.54 | 77.1 | 48.1 | 26.9 |
SEConv(Ours) | 35.84 | 102.8 | 49.7 | 28.2 |
4.6. Comparative Experiments
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Zhang, G.; Luo, Z.; Chen, Y.; Zheng, Y.; Lin, W. Illumination unification for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6766–6777. [Google Scholar] [CrossRef]
- Karen, S. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Zhang, G.; Fang, W.; Zheng, Y.; Wang, R. A Spatial Dual-Branch Attention Dehazing Network based on Meta-Former Paradigm. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 60–70. [Google Scholar] [CrossRef]
- Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
- Zhang, G.; Zhang, H.; Lin, W.; Chandran, A.K.; Jing, X. Camera contrast learning for unsupervised person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2023, 4096–4107. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Faster, R. Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 9199, 2969239–2969250. [Google Scholar]
- Zhang, G.; Liu, J.; Chen, Y.; Zheng, Y.; Zhang, H. Multi-biometric unified network for cloth-changing person re-identification. IEEE Trans. Image Process. 2023, 32, 4555–4566. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Ultralytics. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. 2022. Available online: https://github.com/ultralytics/yolov5.com (accessed on 7 May 2023).
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Yang, C.; Huang, Z.; Wang, N. Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
- Aibibu, T.; Lan, J.; Zeng, Y.; Lu, W.; Gu, N. An efficient rep-style gaussian–wasserstein network: Improved uav infrared small object detection for urban road surveillance and safety. Remote Sens. 2023, 16, 25. [Google Scholar] [CrossRef]
- Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
- Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1192–1201. [Google Scholar]
- Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
- Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-enhanced CenterNet for small object detection in remote sensing images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
- Kong, T.; Yao, A.; Chen, Y.; Sun, F. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 845–853. [Google Scholar]
- Yang, F.; Choi, W.; Lin, Y. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2129–2137. [Google Scholar]
- Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–370. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Zhang, H.; Wang, K.; Tian, Y.; Gou, C.; Wang, F.Y. MFR-CNN: Incorporating multi-scale features and global information for traffic object detection. IEEE Trans. Veh. Technol. 2018, 67, 8019–8030. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar]
- Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
- Wang, M.; Li, Q.; Gu, Y.; Pan, J. Highly Efficient Anchor-Free Oriented Small Object Detection for Remote Sensing Images via Periodic Pseudo-Domain. Remote Sens. 2023, 15, 3854. [Google Scholar] [CrossRef]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
- Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
- Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854. [Google Scholar]
- Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
- Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
- Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
- Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
- Siliang, M.; Yong, X. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
- Zhu, Y.; Zhou, Q.; Liu, N.; Xu, Z.; Ou, Z.; Mou, X.; Tang, J. Scalekd: Distilling scale-aware knowledge in small object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19723–19733. [Google Scholar]
- Ultralytics. YOLO by Ultralytics (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 17 April 2023).
- Ozpoyraz, B.; Dogukan, A.T.; Gevez, Y.; Altun, U.; Basar, E. Deep learning-aided 6G wireless networks: A comprehensive survey of revolutionary PHY architectures. IEEE Open J. Commun. Soc. 2022, 3, 1749–1809. [Google Scholar] [CrossRef]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11632–11641. [Google Scholar]
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 526–543. [Google Scholar]
Method | Params(M) | FLOPs | Input | Output |
---|---|---|---|---|
Conv(3 × 3) | 0.59 | 24.2 | (1,256,640,640) | (1,256,640,640) |
SEConv | 0.21 | 8.6 | (1,256,640,640) | (1,256,640,640) |
Configuration | Parameter |
---|---|
Operating System | Ubuntu 18.04 |
GPU | NVIDIA RTX A6000 |
CUDA | 12.2 |
Frame | PyTorch 2.0.1 |
Programming Language | Python 3.8 |
Datasets | Method | Params(M) | AP | FPS(f/s) | |||||
---|---|---|---|---|---|---|---|---|---|
PASCAL VOC | Baseline | 37.62 | 84.4 ± 0.2 | 62.5 ± 0.1 | 67.8 ± 0.3 | 25.9 ± 0.2 | 50.9 ± 0.1 | 71.4 ± 0.2 | 68.6 |
YOLOv5l | 46.20 | 80.5 ± 0.3 | 58.8 ± 0.2 | 62.9 ± 0.4 | 21.6 ± 0.3 | 44.0 ± 0.2 | 66.7 ± 0.3 | 73.2 | |
YOLOv8l | 43.60 | 82.7 ± 0.4 | 62.3 ± 0.3 | 64.0 ± 0.2 | 24.9 ± 0.3 | 48.7 ± 0.2 | 68.2 ± 0.4 | 70.5 | |
Ours | 39.52 | 85.5 ± 0.2 | 63.8 ± 0.1 | 69.3 ± 0.3 | 27.0 ± 0.2 | 51.7 ± 0.1 | 73.0 ± 0.2 | 69.2 | |
VisDrone-DET2021 | Baseline | 37.62 | 49.1 ± 0.3 | 28.0 ± 0.2 | 27.6 ± 0.2 | 18.7 ± 0.1 | 39.1 ± 0.3 | 49.3 ± 0.2 | 71.6 |
YOLOv5l | 46.20 | 41.4 ± 0.2 | 24.4 ± 0.1 | 24.8 ± 0.2 | 16.1 ± 0.1 | 34.0 ± 0.2 | 41.7 ± 0.3 | 77.4 | |
YOLOv8l | 43.60 | 42.9 ± 0.3 | 25.9 ± 0.2 | 26.4 ± 0.2 | 17.4 ± 0.1 | 33.9 ± 0.2 | 43.1 ± 0.3 | 60.8 | |
Ours | 39.52 | 53.9 ± 0.2 | 32.1 ± 0.1 | 32.7 ± 0.2 | 23.2 ± 0.1 | 42.9 ± 0.2 | 49.7 ± 0.2 | 72.7 | |
TGRS-HRRSD | Baseline | 37.62 | 89.8 ± 0.2 | 68.9 ± 0.1 | 81.5 ± 0.3 | 28.7 ± 0.2 | 58.1 ± 0.1 | 60.3 ± 0.2 | 74.5 |
YOLOv5l | 46.20 | 89.1 ± 0.3 | 63.9 ± 0.2 | 75.0 ± 0.4 | 30.2 ± 0.3 | 54.0 ± 0.2 | 58.7 ± 0.3 | 74.1 | |
YOLOv8l | 43.60 | 89.9 ± 0.2 | 69.4 ± 0.1 | 80.9 ± 0.3 | 28.6 ± 0.2 | 60.4 ± 0.1 | 62.3 ± 0.2 | 69.8 | |
Ours | 39.52 | 91.4 ± 0.2 | 70.1 ± 0.1 | 83.1 ± 0.3 | 31.0 ± 0.2 | 61.0 ± 0.1 | 63.4 ± 0.2 | 69.9 | |
DOTAv1.0 | Baseline | 37.62 | 76.7 ± 0.2 | 51.6 ± 0.1 | 54.1 ± 0.2 | 25.4 ± 0.1 | 51.3 ± 0.2 | 60.5 ± 0.1 | 73.4 |
YOLOv5l | 46.20 | 73.0 ± 0.3 | 49.0 ± 0.2 | 50.9 ± 0.3 | 20.5 ± 0.2 | 45.3 ± 0.1 | 58.6 ± 0.2 | 68.3 | |
YOLOv8l | 43.60 | 74.5 ± 0.2 | 52.9 ± 0.1 | 56.1 ± 0.2 | 24.9 ± 0.1 | 50.4 ± 0.2 | 57.3 ± 0.1 | 74.4 | |
Ours | 39.52 | 78.2 ± 0.2 | 53.1 ± 0.1 | 55.0 ± 0.2 | 26.8 ± 0.1 | 53.7 ± 0.2 | 60.9 ± 0.1 | 70.6 |
ECFA | SEConv | DFSLoss | WIoUv3 | AP | |
---|---|---|---|---|---|
49.1 | 28.0 | ||||
√ | 52.0 | 30.6 | |||
√ | 49.7 | 28.2 | |||
√ | 50.3 | 28.6 | |||
√ | 50.5 | 29.2 | |||
√ | √ | 52.4 | 30.8 | ||
√ | √ | 51.1 | 29.7 | ||
√ | √ | √ | 53.1 | 29.8 | |
√ | √ | √ | 53.5 | 31.9 | |
√ | √ | √ | √ | 53.9 | 32.1 |
Method | AP | ||
---|---|---|---|
CIoU [39] + DFSLoss | 50.3 | 28.6 | 28.1 |
GIoU [37] + DFSLoss | 50.2 | 28.7 | 28.3 |
DIoU [38] + DFSLoss | 50.1 | 28.3 | 27.9 |
SIoU [50] + DFSLoss | 49.7 | 27.9 | 27.4 |
EIoU [51] + DFSLoss | 49.3 | 27.4 | 27.1 |
MPDIoU [52] + DFSLoss | 50.0 | 28.1 | 27.8 |
WIoUv1 + DFSLoss | 50.4 | 28.9 | 28.3 |
WIoUv2 + DFSLoss | 50.9 | 29.3 | 28.8 |
WIoUv3 + DFSLoss | 51.1 | 29.7 | 29.3 |
Method | Backbone | Params(M) | AP | FPS(f/s) | ||
---|---|---|---|---|---|---|
YOLOv3 [5] | Darknet53 | 61.53 | 40.0 | 22.2 | 22.4 | 54.6 |
YOLOv4 [16] | Darknet53 | 52.50 | 39.2 | 23.5 | 23.4 | 55.0 |
YOLOv5l [17] | Darknet53 | 46.20 | 41.4 | 24.4 | 24.8 | 77.4 |
YOLOX [15] | Darknet53 | 54.20 | 39.1 | 22.4 | 22.7 | 68.9 |
YOLOv6l [18] | EfficientRep | 58.50 | 41.8 | 25.4 | 25.8 | 116 |
YOLOv8l [55] | Darknet | 43.60 | 42.9 | 25.9 | 26.4 | 60.8 |
CascadeNet [56] | ResNet101 | 184.00 | 47.1 | 28.8 | 29.3 | - |
RetinaNet [57] | ResNet50 | 59.20 | 44.9 | 26.2 | 27.1 | 54.1 |
HRDNet [53] | ResNet18 + 101 | 63.60 | 49.3 | 28.3 | 28.2 | - |
GFLV2 [58] (CVPR 2021) | ResNet50 | 72.50 | 50.7 | 28.7 | 28.4 | 19.4 |
RFLA [59] (ECCV 2022) | ResNet50 | 57.30 | 45.3 | 27.4 | - | - |
QueryDet [19] (CVPR 2022) | ResNet50 | - | 48.1 | 28.3 | 28.8 | 14.9 |
ScaleKD [54] (CVPR 2023) | ResNet50 | 43.57 | 49.3 | 29.5 | 30.0 | 20.1 |
Ours | DarkeNet53 | 39.52 | 53.9 | 32.1 | 32.7 | 72.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, F.; Zhang, J.; Zhang, G. FFEDet: Fine-Grained Feature Enhancement for Small Object Detection. Remote Sens. 2024, 16, 2003. https://doi.org/10.3390/rs16112003
Zhao F, Zhang J, Zhang G. FFEDet: Fine-Grained Feature Enhancement for Small Object Detection. Remote Sensing. 2024; 16(11):2003. https://doi.org/10.3390/rs16112003
Chicago/Turabian StyleZhao, Feiyue, Jianwei Zhang, and Guoqing Zhang. 2024. "FFEDet: Fine-Grained Feature Enhancement for Small Object Detection" Remote Sensing 16, no. 11: 2003. https://doi.org/10.3390/rs16112003
APA StyleZhao, F., Zhang, J., & Zhang, G. (2024). FFEDet: Fine-Grained Feature Enhancement for Small Object Detection. Remote Sensing, 16(11), 2003. https://doi.org/10.3390/rs16112003