Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion
<p>FA-CenterNet network structure. FA-CenterNet uses EfficientNet-B3 as its backbone network, adding FPT and AFF modules. Compared to the original CenterNet, FA-CenterNet improves the accuracy of underwater holothurian detection while reducing FLOPs and Params. There is a down-sampling relationship between blocks 7, 5, 3, and 2 in which stride is 2. For ease of describing the details of the FPT implementation, blocks 7, 5, 3, and 2 outputs are named X0, X1, X2, and X3. It can be observed that the FPT module incorporates two distinct sets of features.</p> "> Figure 2
<p>MBConv module.</p> "> Figure 3
<p>Holothurian scene in the CURPC dataset. (<b>a</b>) Reefs and holothurian appear in the same scenario. (<b>b</b>)Waterweeds and holothurian appear in the same scenario. (<b>c</b>) Holothurians whose body features are blurred but can be identified by its spines.</p> "> Figure 4
<p>Improved structure of FPT modules. Different texture patterns represent different feature converters, and different colors represent feature maps of different scales. In order to describe the FPT module more succinctly, the outputs of blocks 7, 5, 3, and 2 are named X0, X1, X2, and X3. “Conv1” and “Conv2” on the right-hand side of the structure are 3 × 3 convolution modules with 192 and 96 output channels, respectively. (<b>a</b>) The FPT input is a feature pyramid consisting of two combinations. (<b>b</b>) FPT are the designs of three transformers (<b>c</b>) FPT output that controls the number of feature channels.</p> "> Figure 5
<p>FPT feature interaction diagram.</p> "> Figure 6
<p>Structure of the MS-CAM module.</p> "> Figure 7
<p>Structure of the AFF module.</p> "> Figure 8
<p>Impact of Score thresholds. (<b>a</b>) Precision vary with the score threshold. (<b>b</b>) Recall vary with the score threshold. (<b>c</b>) <span class="html-italic">F</span><sub>1</sub>-<span class="html-italic">scores</span> vary with the score threshold.</p> "> Figure 9
<p>Waterweed falsely tested as holothurian.</p> "> Figure 10
<p>Visualizing heat maps of four models. (<b>a</b>) Original Input Pictures. (<b>b</b>) Heatmap visualization of the CenterNet. (<b>c</b>) Heatmap visualization of the CenterNet(B3). (<b>d</b>) Heatmap visualization of the F-CenterNet. (<b>e</b>) Heatmap visualization of the FA-CenterNet.</p> "> Figure 11
<p>Performance of different detection methods in CURPC 2020 datasets. (<b>a</b>) Original Input Pictures. (<b>b</b>) Results of SSD. (<b>c</b>) Results of YOLOv3. (<b>d</b>) Results of YOLOv4-tiny. (<b>e</b>) Results of YOLOv5-s. (<b>f</b>) Results of YOLOv5-l. (<b>g</b>) Results of FA-CenterNet.</p> ">
Abstract
:1. Introduction
- (1)
- We propose an improved CenterNet model for holothurian detection that replaces the original backbone network, ResNet 50, with a more robust EfficientNet-B3. EfficientNet-B3 reduces the Params and FLOPs of the model, while increasing the depth and width of the model by using neural network architecture search (NAS) technology and the Depthwise Separable Convolution strategy. High-performance EfficientNet-B3 considerably improves the feasibility of deploying the model to resource-limited embedded devices.
- (2)
- In order to improve the accuracy of holothurian detection by making full use of the holothurian feature and co-existing scene information (e.g., waterweeds, reefs, and holothurian spines), we propose to add an FPT module between the backbone and neck networks. FPT uses three submodules, ST, GT, and RT, to integrate features from different scales and spaces, making full use of special scene features and details of holothurians to improve the accuracy of holothurian detection. At the same time, this paper improves the implementation of the FPT module in the target-detection network, adopts two FPT modules, inputs two different characteristic combinations, and then can integrate the model into more ecological scene information for holothurian detection.
- (3)
- In this paper, we use the AFF module to achieve a better integration of multi-scale features. Unlike conventional linear feature fusion (such as “Concat”), the AFF module can simultaneously combine global feature attention and local feature attention to achieve the effective fusion of low-level-detail and high-level-semantic features, thus improving the accuracy of holothurian detection.
2. Proposed Method
2.1. Overall Network Structure
2.2. EfficientNet-B3
2.3. FPT Module
2.3.1. Self-Transformer
2.3.2. Grounding Transformer
2.3.3. Rendering Transformer
2.4. AFF Module
3. Experiments
3.1. Experimental Setting
3.1.1. Dataset
3.1.2. Implementation Details
3.2. Ablation Experiments and Analysis
3.2.1. Quantitative Evaluations
3.2.2. Effectiveness of Components in FPT
3.2.3. Impact of Score Thresholds
3.2.4. Visualization of the Heatmap
3.3. Comparison Experiments and Analysis
3.3.1. Comparison with State-of-the-Art Methods
3.3.2. The Prediction Visualizations of Different Methods
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Loss Functions
Appendix B. Calculation Equation of FPT Component
Appendix B.1. Self-Transformer
Appendix B.2. Grounding Transformer
Appendix B.3. Rendering Transformer
References
- Schoening, T.; Bergmann, M.; Ontrup, J.; Taylor, J.; Dannheim, J.; Gutt, J.; Purser, A.; Nattkemper, T.W. Semi-automated image analysis for the assessment of megafaunal densities at the arctic deep-sea observatory HAUSGARTEN. PLoS ONE 2012, 7, e38179. [Google Scholar] [CrossRef]
- Fabic, J.N.; Turla, I.E.; Capacillo, J.A.; David, L.T.; Naval, P.C. Fish population estimation and species classification from underwater video sequences using blob counting and shape analysis. In Proceedings of the 2013 IEEE International Underwater Technology Symposium (UT), Tokyo, Japan, 5–8 March 2013; pp. 1–6. [Google Scholar] [CrossRef]
- Hsiao, Y.; Chen, C.; Lin, S.; Lin, F. Real-world underwater fish recognition and identification, using sparse representation. Ecol. Inform. 2014, 23, 13–21. [Google Scholar] [CrossRef]
- Qiao, X.; Bao, J.; Zeng, L.; Zou, J.; Li, D. An automatic active contour method for sea cucumber segmentation in natural underwater environments. Comput. Electron. Agric. 2017, 135, 134–142. [Google Scholar] [CrossRef]
- Qiao, X.; Bao, J.; Zhang, H.; Wan, F.; Li, D. fvUnderwater sea cucumber identification based on principal component analysis and support vector machine. Meas. J. Int. Meas. Confed. 2019, 133, 444–455. [Google Scholar] [CrossRef]
- Li, X.; Shang, M.; Qin, H.; Chen, L. Fast accurate fish detection and recognition of underwater images with fast R-CNN. In Proceedings of the OCEANS 2015-MTS/IEEE Washington, Washington, DC, USA, 19–22 October 2015; pp. 1–5. [Google Scholar] [CrossRef]
- Zurowietz, M.; Langenkämper, D.; Hosking, B.; Ruhl, H.A.; Nattkemper, T.W. MAIA-A machine learning assisted image annotation method for environmental monitoring and exploration. PLoS ONE 2018, 13, e0207498. [Google Scholar] [CrossRef] [PubMed]
- Shi, T.; Liu, M.; Niu, Y.; Yang, Y.; Huang, Y. Underwater targets detection and classification in complex scenes based on an improved YOLOv3 algorithm. J. Electron. Imaging 2020, 29, 043013. [Google Scholar] [CrossRef]
- Liu, H.; Song, P.; Ding, R. WQT and DG-YOLO: Towards domain generalization in underwater object detection. arXiv 2020, arXiv:2004.06333. [Google Scholar] [CrossRef]
- Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight underwater object detection based on YOLO v4 and multi-scale attentional feature fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
- Piechaud, N.; Howell, K.L. Fast and accurate mapping of fine scale abundance of a VME in the deep sea with computer vision. Ecol. Inform. 2022, 71, 101786. [Google Scholar] [CrossRef]
- Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Law, H.; Deng, J. CornerNet: Detecting objects as paired keypoints. Int. J. Comput. Vis. 2019, 128, 642–656. [Google Scholar] [CrossRef]
- Zhou, X.; Zhuo, J.; Krähenbühl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar] [CrossRef]
- Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature pyramid transformer. In Computer Vision—ECCV 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 323–339. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. arXiv 2017, arXiv:1711.07971. [Google Scholar] [CrossRef]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2021; pp. 3559–3568. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single shot MultiBox detector. In Computer Vision—ECCV 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
Module | Component | Component Layers | Kernel/Stride | Outputs |
---|---|---|---|---|
Stem | Conv | 1 | (3 × 3)/2 | (256, 256, 40) |
Block 1 | MBConv1 | 2 | (3 × 3)/1 | (256, 256, 24) |
Block 2 | MBConv6 | 3 | (3 × 3)/2 | (128, 128, 32) |
Block 3 | MBConv6 | 3 | (5 × 5)/2 | (64, 64, 48) |
Block 4 | MBConv6 | 5 | (3 × 3)/1 | (32, 32, 96) |
Block 5 | MBConv6 | 5 | (5 × 5)/2 | (32, 32, 136) |
Block 6 | MBConv6 | 6 | (5 × 5)/2 | (16, 16, 232) |
Block 7 | MBConv6 | 2 | (3 × 3)/1 | (16, 16, 384) |
Environment | Version |
---|---|
CPU | Intel i9-10920X, 3.50 GHz |
GPU | NVIDIA RTX 2080Ti |
OS | Windows10 |
CUDA/CUDNN | V 10.1/V 7.6.5 |
Python | V 3.8 |
Pytorch | V 1.2.0 |
Models | Backbone | FPT | AFF | AP50 | Params | FLOPs |
---|---|---|---|---|---|---|
CenterNet | Resnet50 | 79.03% | 32.66 M | 70.12 G | ||
CenterNet(B3) | EfficientNet-B3 | 80.29% | 12.45 M | 15.79 G | ||
F-CenterNet | EfficientNet-B3 | √ | 82.87% | 17.16 M | 41.49 G | |
FA-CenterNet | EfficientNet-B3 | √ | √ | 83.43% | 15.90 M | 25.31 G |
Models | Combination | AP50 | Params | FLOPs | |
---|---|---|---|---|---|
FPT-1 | FPT-2 | ||||
CenterNet(B3) | - | 80.29% | 12.45 M | 15.79 G | |
+ST | X2, ST(X2) | X1, ST(X1) | 81.15% (↑0.86%) | 14.78 M | 35.23 G |
+GT | X2, GT(X2,X1) | X1, GT(X1,X0) | 80.79% (↑0.50%) | 16.67 M | 34.94 G |
+RT | X2, RT(X2,X3) | X1, RT(X1,X2) | 81.65% (↑1.36%) | 16.67 M | 38.80 G |
+FPT | X2, ST(X2), GT(X2,X1), RT(X2,X3) | X1, ST(X1), GT(X1,X0), RT(X1,X2) | 82.87% (↑2.58%) | 17.16 M | 41.49 G |
Models | Backbone | AP50 | Params | FLOPs |
---|---|---|---|---|
SSD | VGG19 | 76.30% | 26.285 M | 180.44 G |
YOLOv3 | DarkNet-53 | 75.03% | 61.54 M | 99.39 G |
YOLOv4-tiny | CSPDarknet53-tiny | 60.58% | 5.88 M | 10.34 G |
YOLOv5-s | CSPDarknet | 80.31% | 7.07 M | 10.56 G |
YOLOv5-l | CSPDarknet | 84.14% | 47.01 M | 115.92 G |
CenterNet | ResNet50 | 79.03% | 32.66 M | 70.01 G |
FA-CenterNet | EfficientNet-B3 | 83.43% | 15.90 M | 25.12 G |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Han, Y.; Chen, L.; Luo, Y.; Ai, H.; Hong, Z.; Ma, Z.; Wang, J.; Zhou, R.; Zhang, Y. Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion. Sensors 2022, 22, 7204. https://doi.org/10.3390/s22197204
Han Y, Chen L, Luo Y, Ai H, Hong Z, Ma Z, Wang J, Zhou R, Zhang Y. Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion. Sensors. 2022; 22(19):7204. https://doi.org/10.3390/s22197204
Chicago/Turabian StyleHan, Yanling, Liang Chen, Yu Luo, Hong Ai, Zhonghua Hong, Zhenling Ma, Jing Wang, Ruyan Zhou, and Yun Zhang. 2022. "Underwater Holothurian Target-Detection Algorithm Based on Improved CenterNet and Scene Feature Fusion" Sensors 22, no. 19: 7204. https://doi.org/10.3390/s22197204