Abstract
As a dense prediction task aimed at instance-level human analysis, dense-pose estimation seeks to accurately map 2D pixels onto the 3D surface of the human body. Despite significant progress has been made, two major challenges continue to confront the research community: the first is training instability caused by a large number of surface points to be regressed; the second is the significant amount of time and computational resources to manually adjust multi-task loss weights. To overcome these challenges, we present a novel dense pose estimator, named UV R-CNN, which is based on a detailed analysis of the loss formulation used in existing algorithms. The proposed UV R-CNN first introduces a novel surface point regression loss, which serves to constrain the immense loss and stable the training progress, named Dense Points Loss (DP-Loss). Additionally, we incorporates a Balanced Weighting Strategy (BWS) that allows for the automatic adaptation of loss weights. Remarkably, without auxiliary supervision and external knowledge from other tasks, UV R-CNN can be trained with larger learning rate, achieving 65.0% APgps and 66.1% \(AP_{gps^{m}}\) on the DensePose-COCO validation subset with ResNet-50-FPN as backbone, competitive to the state-of-the-art methods.
Similar content being viewed by others
Data Availability
The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.
References
Bachmann R, Mizrahi D, Atanov A, Zamir A (2022) Multimae: Multi-modal multi-task masked autoencoders. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXXVII, pp. 348–367. Springer
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79(29-30):20483–20518
Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for covid-19 lesion segmentation from ct scans. SIViP, 1–8
Bakkouri I, Afdel K, Benois-Pineau J (2022) Initiative, G.C.F.t.A.D.N.: Bg-3dm2f: Bidirectional gated 3d multi-scale feature fusion for alzheimer’s disease diagnosis. Multimed Tools Appl 81(8):10743–10776
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, proceedings, Part V 14, pp 561–578. Springer
Boudjit K, Ramzan N (2022) Human detection based on deep learning yolo-v2 for real-time uav applications. J Exp Theor Artif Intell 34(3):527–544
Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Cipolla R, Gal Y, Kendall A (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7482–7491
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International conference on computer vision (ICCV), pp 2650–2658
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 8359–8367
Gkioxari G, Hariharan B, Girshick RB, Malik J (2014) R-cnns for pose estimation and action detection. arXiv:1406.5212
Gong K, Liang X, Zhang D, Shen X, Lin L (2017) Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 932–940
Güler RA, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7297–7306
Guo Y, Gao L, Song J, Wang P, Xie W, Shen HT (2019) Adaptive multi-path aggregation for human densepose estimation in the wild. In: Proceedings of the 27th ACM International conference on multimedia, pp 356–364
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778
Hikmat A, Afdel K, Bakkouri I (2020) Automatic detection of stellate lesions in digital mammograms using multi-scale sift. J Pharm Pharmacol 8:24–34
Hwang D-H, Kim S, Monet N, Koike H, Bae S (2020) Lightweight 3d human pose estimation network training using teacher-student learning. In: 2020 IEEE Winter conference on applications of computer vision (WACV), pp 479–488
Jin Y, Chen Y, Wang L, Wang J, Yu P, Liang L, Hwang J-N, Liu Z (2022) The overlooked classifier in human-object interaction recognition. arXiv:2203.05676
Khirodkar R, Chari V, Agrawal A, Tyagi A (2021) Multi-instance pose networks: Rethinking top-down pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 3122–3131
Li W-H, Liu X, Bilen H (2021) Universal representation learning from multiple domains for few-shot classification. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 9526–9535
Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L (2022) Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13619–13627
Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 20123–20132
Lin T-Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 936–944
Liu K, Choi O, Wang J, Hwang W (2022) Cdgnet: Class distribution guided network for human parsing. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4473–4482
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Ma L, Liu L, Theobalt C, Van Gool L (2021) Direct dense pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp 721–730. IEEE
Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3994–4003
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 7753–7762
Rebuffi S-A, Bilen H, Vedaldi A (2017) Learning multiple visual domains with residual adapters. In: NIPS’17 Proceedings of the 31st International conference on neural information processing systems, pp 506–516
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 9626–9635
Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3614–3633
Varga LA, Kiefer B, Messmer M, Zell A (2022) Seadronessee: a maritime benchmark for detecting humans in open water. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 2260–2270
Wang X, Gao L, Song J, Shen HT (2020) Ktn: Knowledge transfer network for multi-person densepose estimation. In: Proceedings of the 28th ACM International conference on multimedia, pp 3780–3788
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7794–7803
Wang W, Zhou T, Qi S, Shen J, Zhu S-C (2021) Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans Pattern Anal Mach Intell 44(7):3508–3522
Wu X, Li Y-L, Liu X, Zhang J, Wu Y, Lu C (2022) Mining cross-person cues for body-part interactiveness learning in hoi detection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp 121–136. Springer
Yang L, Liu Z, Zhou T, Song Q (2022) Part decomposition and refinement network for human parsing. IEEE/CAA Journal of Automatica Sinica 9 (6):1111–1114
Yang L, Song Q, Wang Z, Hu M, Liu C (2020) Hier r-cnn: Instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54
Yang L, Song Q, Wang Z, Hu M, Liu C, Xin X, Jia W, Xu S (2020) Renovating parsing r-cnn for accurate multiple human parsing. In: European Conference on computer vision, pp 421–437. Springer
Yang L, Song Q, Wang Z, Jiang M (2019) Parsing r-cnn for instance-level human analysis. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 364–373
Yang L, Song Q, Wu Y, Hu M (2018) Attention inspiring receptive-fields network for learning invariant representations. IEEE Trans Neural Netw Learn Syst 30(6):1744–1755
Ye H, Xu D (2022) Inverted pyramid multi-task transformer for dense scene understanding. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXVII, pp. 514–530. Springer
Yuan H, Wang M, Ni D, Xu L (2022) Detecting human-object interactions with object-guided cross-modal calibrated semantics. In: Proceedings of the AAAI Conference on artificial intelligence, vol 36, pp 3206–3214
Zauss D, Kreiss S, Alahi A (2021) Keypoint communities. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 11057–11066
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: 13th european conference on computer vision, ECCV 2014, pp 818–833
Zeng A, Ju X, Yang L, Gao R, Zhu X, Dai B, Xu Q (2022) Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp 607–624. Springer
Zhang X, Chen Y, Tang M, Wang J, Zhu X, Lei Z (2022) Human parsing with part-aware relation modeling. IEEE Transactions on Multimedia
Zhang Q, Jiang Y, Zhou Q, Zhao Y, Liu Y, Lu H, Hua X-S (2021) Single person dense pose estimation via geometric equivariance consistency. IEEE Transactions on Multimedia
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 129:3069–3087
Zhao J, Li J, Cheng Y, Sim T, Yan S, Feng J (2018) Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In: Proceedings of the 26th ACM International conference on multimedia, pp 792–800
Zhao Y, Li J, Zhang Y, Tian Y (2022) From pose to part: Weakly-supervised pose evolution for human part segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhao W, Li C, Zhang W, Yang L, Zhuang P, Li L, Fan K, Yang H (2022) Embedding global contrastive and local location in self-supervised learning. IEEE Transactions on Circuits and Systems for Video Technology, 1–1. https://doi.org/10.1109/TCSVT.2022.3221611
Zhu X, Song Q (2021) Joint model for human body part instance segmentation and densepose estimation. In: 2021 9Th international conference on communications and broadband networking, pp 66–73
Zhu B, Song Q, Yang L, Wang Z, Liu C, Hu M (2021) Cpm r-cnn: Calibrating point-guided misalignment in object detection. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 3248–3257
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
No potential conflict of interest was reported by the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yilin Zhou, Mengjie Hu and Chun Liu are contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jia, W., Zhu, X., Zhou, Y. et al. UV R-CNN: Stable and efficient dense human pose estimation. Multimed Tools Appl 83, 24699–24714 (2024). https://doi.org/10.1007/s11042-023-15379-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15379-w