IV-YOLO: A Lightweight Dual-Branch Object Detection Network
<p>The overall architecture of the dual-branch IV-YOLO network. This network is specifically designed for object detection in complex environments. The backbone network extracts features from visible light and infrared images in parallel and fuses the multi-scale features obtained from the P2 to P5 layers. The fused features are fed into the infrared branch for deeper feature extraction. In the neck structure, we employ a Shuffle-SPP module, which integrates the extracted features with those from the backbone network through a three-layer upsampling process, enabling precise detection of objects at different scales.</p> "> Figure 2
<p>Diagram of the C2F Module.</p> "> Figure 3
<p>Bidirectional Pyramid Feature Fusion Structure, Bi-Fusion Structure.</p> "> Figure 4
<p>Shuffle Attention Spatial Pyramid Pooling Structure.</p> "> Figure 5
<p>Visualization of IV-YOLO Detection Results Based on the Drone Vehicle Dataset.</p> "> Figure 6
<p>Visualization of IV-YOLO Detection Results on the FLIR Dataset.</p> "> Figure 7
<p>Visualization of IV-YOLO Detection Results on the KAIST Pedestrian Dataset.</p> "> Figure 8
<p>Bar chart of mAP@0.5 in the ablation experiments.</p> "> Figure 9
<p>Bar chart of mAP@0.5:0.95 in the ablation experiments.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Real-Time Object Detector
2.2. Deep Learning-Based Multimodal Image Fusion
2.3. Attention Mechanism
3. Methods
3.1. Overall Network Architecture
3.2. Feature Fusion Structure
3.2.1. Bidirectional Pyramid Feature Fusion Structure
3.2.2. Shuffle Attention Spatial Pyramid Pooling Structure
3.3. Loss Function
4. Results
4.1. Dataset Introduction
4.1.1. Drone Vehicle Dataset
4.1.2. FLIR Dataset
4.1.3. KAIST Dataset
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Analysis of Results
4.4.1. Experiment on the DroneVehicle Dataset
4.4.2. Experiments on the FLIR Dataset
4.4.3. Experiments Based on the KAIST Dataset
4.5. Parameter Analysis
4.6. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, H.; Xu, H.; Xiao, Y.; Guo, X.; Ma, J. Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12797–12804. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
- Qu, L.; Liu, S.; Wang, M.; Song, Z. Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2126–2134. [Google Scholar]
- Munsif, M.; Khan, N.; Hussain, A.; Kim, M.J.; Baik, S.W. Darkness-Adaptive Action Recognition: Leveraging Efficient Tubelet Slow-Fast Network for Industrial Applications. IEEE Trans. Ind. Inform. 2024; early access. [Google Scholar]
- Munsif, M.; Khan, S.U.; Khan, N.; Baik, S.W. Attention-based deep learning framework for action recognition in a dark environment. Hum. Centric Comput. Inf. Sci. 2024, 14, 1–22. [Google Scholar]
- Wen, X.; Wang, F.; Feng, Z.; Lin, J.; Shi, C. MDFN: Multi-scale Dense Fusion Network for RGB-D Salient Object Detection. In Proceedings of the 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 26–28 May 2023; Volume 3, pp. 730–734. [Google Scholar]
- Han, D.; Li, L.; Guo, X.; Ma, J. Multi-exposure image fusion via deep perceptual enhancement. Inf. Fusion 2022, 79, 248–262. [Google Scholar] [CrossRef]
- Hou, J.; Zhang, D.; Wu, W.; Ma, J.; Zhou, H. A generative adversarial network for infrared and visible image fusion based on semantic segmentation. Entropy 2021, 23, 376. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2849–2858. [Google Scholar]
- French, G.; Finlayson, G.; Mackiewicz, M. Multi-spectral pedestrian detection via image fusion and deep neural networks. J. Imaging Sci. Technol. 2018, 176–181. [Google Scholar]
- Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef]
- Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef]
- Kim, J.U.; Park, S.; Ro, Y.M. Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1510–1523. [Google Scholar] [CrossRef]
- Zhu, J.; Zhang, X.; Dong, F.; Yan, S.; Meng, X.; Li, Y.; Tan, P. Transformer-based Adaptive Interactive Promotion Network for RGB-T Salient Object Detection. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 1989–1994. [Google Scholar]
- Ye, Z.; Peng, Y.; Han, B.; Hao, H.; Liu, W. Unmanned Aerial Vehicle Target Detection Algorithm Based on Infrared Visible Light Feature Level Fusion. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 1–5. [Google Scholar]
- Gao, Y.; Cheng, Z.; Su, H.; Ji, Z.; Hu, J.; Peng, Z. Infrared and Visible Image Fusion Method based on Residual Network. In Proceedings of the 2023 4th International Conference on Computer Engineering and Intelligent Control (ICCEIC), Guangzhou, China, 20–22 October 2023; pp. 366–370. [Google Scholar]
- Ataman, F.C.; Akar, G.B. Visible and infrared image fusion using encoder-decoder network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 1779–1783. [Google Scholar]
- Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1186–1196. [Google Scholar] [CrossRef]
- Wang, S.; Li, X.; Huo, W.; You, J. Fusion of infrared and visible images based on improved generative adversarial networks. In Proceedings of the 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Guangzhou, China, 22–24 July 2022; pp. 247–251. [Google Scholar]
- Liu, H.; Liu, H.; Wang, Y.; Sun, F.; Huang, W. Fine-grained multilevel fusion for anti-occlusion monocular 3d object detection. IEEE Trans. Image Process. 2022, 31, 4050–4061. [Google Scholar] [CrossRef]
- Bao, W.; Hu, J.; Huang, M.; Xu, Y.; Ji, N.; Xiang, X. Detecting Fine-Grained Airplanes in SAR Images with Sparse Attention-Guided Pyramid and Class-Balanced Data Augmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8586–8599. [Google Scholar] [CrossRef]
- Nousias, S.; Pikoulis, E.V.; Mavrokefalidis, C.; Lalos, A.S. Accelerating deep neural networks for efficient scene understanding in multi-modal automotive applications. IEEE Access 2023, 11, 28208–28221. [Google Scholar] [CrossRef]
- Poeppel, A.; Eymüller, C.; Reif, W. SensorClouds: A Framework for Real-Time Processing of Multi-modal Sensor Data for Human-Robot-Collaboration. In Proceedings of the 2023 9th International Conference on Automation, Robotics and Applications (ICARA), Abu Dhabi, United Arab Emirates, 10–12 February 2023; pp. 294–298. [Google Scholar]
- Ultralytics. YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 November 2023.).
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- FLIR ADAS Dataset. 2022. Available online: https://www.flir.com/oem/adas/adas-dataset-form (accessed on 19 January 2022).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 September 2024).
- Li, C.Y.; Wang, N.; Mu, Y.Q.; Wang, J.; Liao, H.Y.M. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Farahnakian, F.; Heikkonen, J. Deep learning based multi-modal fusion architectures for maritime vessel detection. Remote Sens. 2020, 12, 2509. [Google Scholar] [CrossRef]
- Li, R.; Peng, Y.; Yang, Q. Fusion enhancement: UAV target detection based on multi-modal GAN. In Proceedings of the 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 15–17 September 2023; Volume 7, pp. 1953–1957. [Google Scholar]
- Pahde, F.; Puscas, M.; Klein, T.; Nabi, M. Multimodal prototypical networks for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Event, 5–9 January 2021; pp. 2644–2653. [Google Scholar]
- Liang, P.; Jiang, J.; Liu, X.; Ma, J. Fusion from decomposition: A self-supervised decomposition approach for image fusion. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 719–735. [Google Scholar]
- Dai, X.; Yuan, X.; Wei, X. TIRNet: Object detection in thermal infrared images for autonomous driving. Appl. Intell. 2021, 51, 1244–1261. [Google Scholar] [CrossRef]
- Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
- Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal infrared image colorization for nighttime driving scenes with top-down guided attention. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15808–15823. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Guo, Z.; Ma, H.; Liu, J. NLNet: A narrow-channel lightweight network for finger multimodal recognition. Digit. Signal Process. 2024, 150, 104517. [Google Scholar] [CrossRef]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3024–3033. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
- Salman, H.; Parks, C.; Swan, M.; Gauch, J. Orthonets: Orthogonal channel attention networks. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 829–837. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
- Wang, H.; Wang, C.; Fu, Q.; Si, B.; Zhang, D.; Kou, R.; Yu, Y.; Feng, C. YOLOFIV: Object detection algorithm for around-the-clock aerial remote sensing images by fusing infrared and visible features. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 15269–15287. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
- Li, S.; Li, Y.; Li, Y.; Li, M.; Xu, X. Yolo-firi: Improved yolov5 for infrared image object detection. IEEE Access 2021, 9, 141861–141875. [Google Scholar] [CrossRef]
- Jiang, X.; Cai, W.; Yang, Z.; Xu, P.; Jiang, B. IARet: A lightweight multiscale infrared aerocraft recognition algorithm. Arab. J. Sci. Eng. 2022, 47, 2289–2303. [Google Scholar] [CrossRef]
- Li, Q.; Zhang, C.; Hu, Q.; Fu, H.; Zhu, P. Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian detection. IEEE Trans. Multimed. 2022, 25, 3420–3431. [Google Scholar] [CrossRef]
- Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048. [Google Scholar]
- Zheng, Z.; Wu, Y.; Han, X.; Shi, J. Forkgan: Seeing into the rainy night. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 155–170. [Google Scholar]
- Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Van Gool, L. Night-to-day image translation for retrieval-based localization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, BC, Canada, 20–24 May 2019; pp. 5958–5964. [Google Scholar]
- Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Hyper Parameter | Drone Vehicle Dataset | FLIR Dataset | KAIST Dataset |
---|---|---|---|
Scenario | drone | adas | pedestrian |
Modality | Infrared + Visible | Infrared + Visible | Infrared + Visible |
Images | 56,878 | 14,000 | 95,328 |
Categories | 5 | 4 | 3 |
Labels | 190.6 K | 14.5 K | 103.1 K |
Resolution | 840 × 712 | 640 × 512 | 640 × 512 |
Category | Parameter |
---|---|
CPU Intel | i7-12700H (Intel Corporation, Santa Clara, CA, USA) |
GPU | NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) |
System | Windows11 |
Python | 3.8.19 |
PyTorch | 1.12.1 |
Training Epochs | 300 |
Learning Rate | 0.01 |
Weight Decay | 0.0005 |
Momentum | 0.937 |
Hyper Parameter | DroneVehicle Dataset | FLIR Dataset | KAIST Dataset |
---|---|---|---|
Visible Image Size | 640 × 640 | 640 × 640 | 640 × 640 |
Infrared Image Size | 640 × 640 | 640 × 640 | 640 × 640 |
Visible Image | 28,439 | 8437 | 8995 |
Infrared Image | 28,439 | 8437 | 8995 |
Training set | 17,990 | 7381 | 7595 |
Validation set | 1469 | 1056 | 1400 |
Testing set | 8980 | 1056 (test-val) | 1400 (test-val) |
Method | Modality | Car | Freight Car | Truck | Bus | Van | mAP |
---|---|---|---|---|---|---|---|
RetinaNet (OBB) [58] | Visible | 67.5 | 13.7 | 28.2 | 62.1 | 19.3 | 38.2 |
Faster R-CNN (OBB) [30] | Visible | 67.9 | 26.3 | 38.6 | 67.0 | 23.2 | 44.6 |
Faster R-CNN (Dpool) [59] | Visible | 68.2 | 26.4 | 38.7 | 69.1 | 26.4 | 45.8 |
Mask R-CNN [60] | Visible | 68.5 | 26.8 | 39.8 | 66.8 | 25.4 | 45.5 |
Cascade Mask R-CNN [61] | Visible | 68.0 | 27.3 | 44.7 | 69.3 | 29.8 | 47.8 |
RoITransformer [10] | Visible | 68.1 | 29.1 | 44.2 | 70.6 | 27.6 | 47.9 |
YOLOv8n (OBB) [25] | Visible | 97.5 | 42.8 | 62.0 | 94.3 | 46.6 | 68.6 |
Oriented RepPoints [62] | Visible | 73.7 | 30.3 | 45.4 | 73.9 | 36.1 | 51.9 |
RetinaNet (OBB) [58] | Infrared | 79.9 | 28.1 | 32.8 | 67.3 | 16.4 | 44.9 |
Faster R-CNN (OBB) [30] | Infrared | 88.6 | 35.2 | 42.5 | 77.9 | 28.5 | 54.6 |
Faster R-CNN (Dpool) [59] | Infrared | 88.9 | 36.8 | 47.9 | 78.3 | 32.8 | 56.9 |
Mask R-CNN [60] | Infrared | 88.8 | 36.6 | 48.9 | 78.4 | 32.2 | 57.0 |
Cascade Mask R-CNN [61] | Infrared | 81.0 | 39.0 | 47.2 | 79.3 | 33.0 | 55.9 |
RoITransformer [10] | Infrared | 88.9 | 41.5 | 51.5 | 79.5 | 34.4 | 59.2 |
YOLOv8n (OBB) [25] | Infrared | 97.1 | 38.5 | 65.2 | 94.5 | 45.2 | 68.1 |
Oriented RepPoints [62] | Infrared | 87.1 | 39.7 | 50.1 | 77.6 | 36.9 | 58.3 |
UA-CMDet [27] | Visible + Infrared | 87.5 | 46.8 | 60.7 | 87.1 | 38.0 | 64.0 |
Dual-YOLO [12] | Visible + Infrared | 98.1 | 52.9 | 65.7 | 95.8 | 46.6 | 71.8 |
YOLOFIV [63] | Visible + Infrared | 95.9 | 34.6 | 64.2 | 91.6 | 37.2 | 64.7 |
IR-YOLO(Ours) | Visible + Infrared | 97.2 | 63.1 | 65.4 | 94.3 | 53.0 | 74.6 |
Method | Person | Bicycle | Car | mAP |
---|---|---|---|---|
Faster R-CNN [30] | 39.6 | 54.7 | 67.6 | 53.9 |
SSD [29] | 40.9 | 43.6 | 61.6 | 48.7 |
RetinaNet [58] | 52.3 | 61.3 | 71.5 | 61.7 |
FCOS [64] | 69.7 | 67.4 | 79.7 | 72.3 |
MMTOD-UNIT [27] | 49.4 | 64.4 | 70.7 | 61.5 |
MMTOD-CG [27] | 50.3 | 63.3 | 70.6 | 61.4 |
RefineDet [65] | 77.2 | 57.2 | 84.5 | 72.9 |
TermalDet [27] | 78.2 | 60.0 | 85.5 | 74.6 |
YOLO-FIR [66] | 85.2 | 70.7 | 84.3 | 80.1 |
YOLOv3-tiny [33] | 67.1 | 50.3 | 81.2 | 66.2 |
IARet [67] | 77.2 | 48.7 | 85.8 | 70.7 |
CMPD [68] | 69.6 | 59.8 | 78.1 | 69.3 |
PearlGAN [46] | 54.0 | 23.0 | 75.5 | 50.8 |
Cascade R-CNN [61] | 77.3 | 84.3 | 79.8 | 80.5 |
YOLOv5s [35] | 68.3 | 67.1 | 80.0 | 71.8 |
YOLOF [69] | 67.8 | 68.1 | 79.4 | 71.8 |
YOLOv10n [39] | 62.9 | 69.9 | 86.2 | 73.0 |
Dual-YOLO [12] | 88.6 | 66.7 | 93 | 84.5 |
IV-YOLO (Ours) | 86.6 | 77.8 | 92.4 | 85.6 |
Method | Precision | Recall | mAP |
---|---|---|---|
ForkGAN [70] | 33.9 | 4.6 | 4.9 |
ToDayGAN [71] | 11.4 | 14.9 | 5.0 |
UNIT [72] | 40.9 | 43.6 | 11.0 |
PearlGAN [46] | 21.0 | 39.8 | 25.8 |
YOLOv9m [38] | 76.5 | 40.5 | 60.8 |
YOLOv10n [39] | 71.5 | 43.2 | 52.5 |
Dual-YOLO [12] | 75.1 | 66.7 | 73.2 |
IV-YOLO(OURS) | 77.2 | 84.5 | 75.4 |
Method | Dataset | Params | Runtime (fps) |
---|---|---|---|
Faster R-CNN (OBB) [30] | Drone Vehicle | 58.3 M | 5.3 |
Faster R-CNN (Dpool) [59] | Drone Vehicle | 59.9 M | 4.3 |
Mask R-CNN [60] | Drone Vehicle | 242.0 M | 13.5 |
RetinaNet [59] | Drone Vehicle | 145.0 M | 15.0 |
Cascade Mask R-CNN [61] | Drone Vehicle | 368.0 M | 9.8 |
RolTransformer [10] | Drone Vehicle | 273.0 M | 7.1 |
YOLOv7 [37] | Drone Vehicle | 72.1 M | 161.0 |
YOLOv8n [25] | Drone Vehicle | 5.92 M | 188.6 |
IV-YOLO (Ours) | Drone Vehicle | 4.31 M | 203.2 |
SSD [29] | FLIR | 131.0 M | 43.7 |
FCOS [64] | FLIR | 123.0 M | 22.9 |
RefineDet [65] | FLIR | 128.0 M | 24.1 |
YOLO-FIR [66] | FLIR | 7.1 M | 83.3 |
YOLOv3-tiny [33] | FLIR | 17.0 M | 66.2 |
Cascade R-CNN [61] | FLIR | 165.0 M | 16.1 |
YOLOv5s [35] | FLIR | 14.0 M | 41.0 |
YOLOF [69] | FLIR | 44.0 M | 32.0 |
Dual-YOLO [12] | FLIR | 175.1 M | 62.0 |
IV-YOLO (Ours) | FLIR | 4.91 M | 194.6 |
Shuffle-SPP | Bi-Concat | Person | Bicycle | Car | [email protected] | [email protected]:0.95 |
---|---|---|---|---|---|---|
× | 83.7 | 76.4 | 92.1 | 84.1 | 50.1 | |
× | 83.2 | 73.8 | 91.6 | 82.9 | 49.1 | |
86.6 | 77.8 | 92.4 | 85.6 | 51.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, D.; Yan, X.; Zhou, D.; Wang, C.; Zhang, W. IV-YOLO: A Lightweight Dual-Branch Object Detection Network. Sensors 2024, 24, 6181. https://doi.org/10.3390/s24196181
Tian D, Yan X, Zhou D, Wang C, Zhang W. IV-YOLO: A Lightweight Dual-Branch Object Detection Network. Sensors. 2024; 24(19):6181. https://doi.org/10.3390/s24196181
Chicago/Turabian StyleTian, Dan, Xin Yan, Dong Zhou, Chen Wang, and Wenshuai Zhang. 2024. "IV-YOLO: A Lightweight Dual-Branch Object Detection Network" Sensors 24, no. 19: 6181. https://doi.org/10.3390/s24196181