Abstract
Keypoint detection is an important research topic in target recognition and classification. This paper studies the detection of keypoints in images of Amur tigers and proposes a target keypoint detection method based on heterogeneous convolution neural networks. Because of the limited storage capacity of the monitoring device and higher accuracy requirement, we propose a heterogeneous convolution called SHetConv, which is composed of group convolution and standard convolution. We use two kinds of SHetConv, one to reduce the computational costs [number of FLOPs (FLOPs stands for the floating-point operations per second .)] and one to increase the receptive field. To further improve the effectiveness of the model, we propose a feature fusion module to make full use of the semantic information and spatial information of images. We evaluate the algorithm on Tiger Pose Keypoint, CIFAR-10 and MPII datasets. The experimental results show that our method has a better accuracy, recall rate and \({F_{{1}}}\)-score than other state-of-the-art keypoint detection methods. Moreover, the number of parameters and FLOPs are substantially reduced. Specifically, the number of parameter and FLOPs of the Our (scaled network + fusion module + shet2) model are 0.14 and 0.143 times those of the big HRNet-W48 model, and its \({F_{{1}}}\)-score is increased by 0.3%.
Similar content being viewed by others
Notes
The size of convolution kernel is \(C \times {K_{1}} \times {K_{1}}\). In our paper, C is the channel of convolution kernel as well as the number of channels of the feature maps that will be convolved. Further, \({K_{1}}\) is the height and weight of the kernel.
References
Rashid, M., Gu, X. and JaeLee, Y.: Interspecies knowledge transfer for facial keypoint detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6894–6903 (2017)
Nguyen, H., Maclagan, S.J., Nguyen, T.D., Nguyen, T., Flemons, P., Andrews, K., Ritchie, E.G. , and Phung, D.: Animal recognition and identification with deep convolutional neural networks for automated wildlife monitoring. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE, pp. 40–49 (2017)
Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., and Schiele, B.: ‘Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision. Springer, pp. 34–50 (2016)
Kocabas, M., Karagoz, S., and Akbas, E.: Multiposenet: Fast multi-person pose estimation using pose residual network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 417–433 (2018)
Newell, A., Huang, Z., and Deng, J.: ‘Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems, pp. 2277–2287 (2017)
Xiao, B., Wu, H., and Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018)
Newell, A., Yang, K., and Deng, J.: “Stacked hourglass networks for human pose estimation. In: Proceedings of the European Conference on Computer Vision. Springer, pp. 483–499 (2016)
Singh, P., Verma, V.K., Rai, P., and Namboodiri, V.P.: Hetconv: heterogeneous kernel-based convolutions for deep cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4835–4844 (2019)
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., and Lin, D.: Libra r-cnn: Towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019)
Cao, G., Xie, X., Yang, W., Liao, Q., Shi, G., and Wu, J.: Feature-fused ssd: fast detection for small objects. In: Ninth International Conference on Graphic and Image Processing (ICGIP 2017), vol. 10615. International Society for Optics and Photonics, p. 106151E (2018)
Sun, K., Xiao, B., Liu, D., and Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
He, K., Zhang, X., Ren, S., and Sun, J.: Deep residual learning for image recognition,. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision. Springer, pp. 740–755 (2014)
Zhao, H., Qi, X., Shen, X., Shi, J., and Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420 (2018)
Ronneberger, O., Fischer, P., and Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 234–241 (2015)
Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Luo, J.-H., Wu, J., and Lin, W.: Thinet: a filter level pruning method for deep neural network compression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5058–5066 (2017)
He,Y., Zhang, X., and Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397 (2017)
Li, H., Kadav, A., Durdanovic, I., H.Samet, and Graf H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (ICLR), (2017)
He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y.: Soft filter pruning for accelerating deep convolutional neural networks. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2234–2240 (2018)
Tan, M., and Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)
Krizhevsky, A., Sutskever, I., and Hinton, G.E.:Imagenet classification with deep convolutional neural networks. In: Advances in Nneural Information Processing Systems, pp. 1097–1105 (2012)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., and Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and%3c 0.5 mb model size. arXiv preprint arXiv:1602.07360, (2016)
Huang, G., Liu, S., Vander Maaten, L., and Weinberger K.Q.: Condensenet: an efficient densenet using learned group convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761 (2018)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, T., and Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, (2017)
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4278–4284 (2017)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Li, S., Li, J., Lin, W., and Tang, H.: Amur tiger re-identification in the wild. arXiv preprint arXiv:1906.05586, (2019)
He, K., Girshick, R., and Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927 (2019)
Acknowledgements
This work was supported by the Major Project of Technological Innovation 2030 -”New Generation Artificial Intelligence” (2018AAA0100800), the National Natural Science Foundation of China (61872042, 61572077, 61972375), the Key Project of the Education Commission of Beijing Municipal (KZ201911417048), Premium Funding Project for Academic Human Resources Development in Beijing Union University(BPHR2020AZ01, BPH2020EZ01), and the Project of High-Level Teachers in Beijing Municipal Universities in the Period of the 13th Five-Year Plan (CIT & TCD 201704069).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B.-K. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yin, X., He, N., Liu, X. et al. SHetConv: target keypoint detection based on heterogeneous convolution neural networks. Multimedia Systems 27, 519–529 (2021). https://doi.org/10.1007/s00530-020-00729-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-020-00729-7