UV R-CNN: Stable and efficient dense human pose estimation

Wenhe Jia¹,
Xuhan Zhu²,
Yilin Zhou¹,
Mengjie Hu¹,
Chun Liu¹ &
…
Qing Song ORCID: orcid.org/0000-0003-1936-224X¹

204 Accesses
1 Citation
Explore all metrics

Abstract

As a dense prediction task aimed at instance-level human analysis, dense-pose estimation seeks to accurately map 2D pixels onto the 3D surface of the human body. Despite significant progress has been made, two major challenges continue to confront the research community: the first is training instability caused by a large number of surface points to be regressed; the second is the significant amount of time and computational resources to manually adjust multi-task loss weights. To overcome these challenges, we present a novel dense pose estimator, named UV R-CNN, which is based on a detailed analysis of the loss formulation used in existing algorithms. The proposed UV R-CNN first introduces a novel surface point regression loss, which serves to constrain the immense loss and stable the training progress, named Dense Points Loss (DP-Loss). Additionally, we incorporates a Balanced Weighting Strategy (BWS) that allows for the automatic adaptation of loss weights. Remarkably, without auxiliary supervision and external knowledge from other tasks, UV R-CNN can be trained with larger learning rate, achieving 65.0% AP_gps and 66.1% $AP_{gps^{m}}$ on the DensePose-COCO validation subset with ResNet-50-FPN as backbone, competitive to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Scale Structure-Aware Network for Human Pose Estimation

Lightweight human pose estimation: CVC-net

Article 07 March 2022

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Data Availability

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

Notes

https://github.com/facebookresearch/DensePose

References

Bachmann R, Mizrahi D, Atanov A, Zamir A (2022) Multimae: Multi-modal multi-task masked autoencoders. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXXVII, pp. 348–367. Springer
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79(29-30):20483–20518
Article Google Scholar
Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for covid-19 lesion segmentation from ct scans. SIViP, 1–8
Bakkouri I, Afdel K, Benois-Pineau J (2022) Initiative, G.C.F.t.A.D.N.: Bg-3dm2f: Bidirectional gated 3d multi-scale feature fusion for alzheimer’s disease diagnosis. Multimed Tools Appl 81(8):10743–10776
Article Google Scholar
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, proceedings, Part V 14, pp 561–578. Springer
Boudjit K, Ramzan N (2022) Human detection based on deep learning yolo-v2 for real-time uav applications. J Exp Theor Artif Intell 34(3):527–544
Article Google Scholar
Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Cipolla R, Gal Y, Kendall A (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7482–7491
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International conference on computer vision (ICCV), pp 2650–2658
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 8359–8367
Gkioxari G, Hariharan B, Girshick RB, Malik J (2014) R-cnns for pose estimation and action detection. arXiv:1406.5212
Gong K, Liang X, Zhang D, Shen X, Lin L (2017) Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 932–940
Güler RA, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7297–7306
Guo Y, Gao L, Song J, Wang P, Xie W, Shen HT (2019) Adaptive multi-path aggregation for human densepose estimation in the wild. In: Proceedings of the 27th ACM International conference on multimedia, pp 356–364
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778
Hikmat A, Afdel K, Bakkouri I (2020) Automatic detection of stellate lesions in digital mammograms using multi-scale sift. J Pharm Pharmacol 8:24–34
Google Scholar
Hwang D-H, Kim S, Monet N, Koike H, Bae S (2020) Lightweight 3d human pose estimation network training using teacher-student learning. In: 2020 IEEE Winter conference on applications of computer vision (WACV), pp 479–488
Jin Y, Chen Y, Wang L, Wang J, Yu P, Liang L, Hwang J-N, Liu Z (2022) The overlooked classifier in human-object interaction recognition. arXiv:2203.05676
Khirodkar R, Chari V, Agrawal A, Tyagi A (2021) Multi-instance pose networks: Rethinking top-down pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 3122–3131
Li W-H, Liu X, Bilen H (2021) Universal representation learning from multiple domains for few-shot classification. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 9526–9535
Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L (2022) Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13619–13627
Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 20123–20132
Lin T-Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 936–944
Liu K, Choi O, Wang J, Hwang W (2022) Cdgnet: Class distribution guided network for human parsing. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4473–4482
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Ma L, Liu L, Theobalt C, Van Gool L (2021) Direct dense pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp 721–730. IEEE
Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3994–4003
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 7753–7762
Rebuffi S-A, Bilen H, Vedaldi A (2017) Learning multiple visual domains with residual adapters. In: NIPS’17 Proceedings of the 31st International conference on neural information processing systems, pp 506–516
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 9626–9635
Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3614–3633
Google Scholar
Varga LA, Kiefer B, Messmer M, Zell A (2022) Seadronessee: a maritime benchmark for detecting humans in open water. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 2260–2270
Wang X, Gao L, Song J, Shen HT (2020) Ktn: Knowledge transfer network for multi-person densepose estimation. In: Proceedings of the 28th ACM International conference on multimedia, pp 3780–3788
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7794–7803
Wang W, Zhou T, Qi S, Shen J, Zhu S-C (2021) Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans Pattern Anal Mach Intell 44(7):3508–3522
Google Scholar
Wu X, Li Y-L, Liu X, Zhang J, Wu Y, Lu C (2022) Mining cross-person cues for body-part interactiveness learning in hoi detection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp 121–136. Springer
Yang L, Liu Z, Zhou T, Song Q (2022) Part decomposition and refinement network for human parsing. IEEE/CAA Journal of Automatica Sinica 9 (6):1111–1114
Article Google Scholar
Yang L, Song Q, Wang Z, Hu M, Liu C (2020) Hier r-cnn: Instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54
Article PubMed ADS Google Scholar
Yang L, Song Q, Wang Z, Hu M, Liu C, Xin X, Jia W, Xu S (2020) Renovating parsing r-cnn for accurate multiple human parsing. In: European Conference on computer vision, pp 421–437. Springer
Yang L, Song Q, Wang Z, Jiang M (2019) Parsing r-cnn for instance-level human analysis. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 364–373
Yang L, Song Q, Wu Y, Hu M (2018) Attention inspiring receptive-fields network for learning invariant representations. IEEE Trans Neural Netw Learn Syst 30(6):1744–1755
Article PubMed Google Scholar
Ye H, Xu D (2022) Inverted pyramid multi-task transformer for dense scene understanding. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXVII, pp. 514–530. Springer
Yuan H, Wang M, Ni D, Xu L (2022) Detecting human-object interactions with object-guided cross-modal calibrated semantics. In: Proceedings of the AAAI Conference on artificial intelligence, vol 36, pp 3206–3214
Zauss D, Kreiss S, Alahi A (2021) Keypoint communities. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 11057–11066
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: 13th european conference on computer vision, ECCV 2014, pp 818–833
Zeng A, Ju X, Yang L, Gao R, Zhu X, Dai B, Xu Q (2022) Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp 607–624. Springer
Zhang X, Chen Y, Tang M, Wang J, Zhu X, Lei Z (2022) Human parsing with part-aware relation modeling. IEEE Transactions on Multimedia
Zhang Q, Jiang Y, Zhou Q, Zhao Y, Liu Y, Lu H, Hua X-S (2021) Single person dense pose estimation via geometric equivariance consistency. IEEE Transactions on Multimedia
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 129:3069–3087
Article Google Scholar
Zhao J, Li J, Cheng Y, Sim T, Yan S, Feng J (2018) Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In: Proceedings of the 26th ACM International conference on multimedia, pp 792–800
Zhao Y, Li J, Zhang Y, Tian Y (2022) From pose to part: Weakly-supervised pose evolution for human part segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhao W, Li C, Zhang W, Yang L, Zhuang P, Li L, Fan K, Yang H (2022) Embedding global contrastive and local location in self-supervised learning. IEEE Transactions on Circuits and Systems for Video Technology, 1–1. https://doi.org/10.1109/TCSVT.2022.3221611
Zhu X, Song Q (2021) Joint model for human body part instance segmentation and densepose estimation. In: 2021 9Th international conference on communications and broadband networking, pp 66–73
Zhu B, Song Q, Yang L, Wang Z, Liu C, Hu M (2021) Cpm r-cnn: Calibrating point-guided misalignment in object detection. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 3248–3257
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159

Download references

Author information

Authors and Affiliations

Artificial Intelligence Academy, Beijing University of Posts and Telecommunications, 10th Xitucheng road, Haidian District, Beijing, 100086, Beijing, China
Wenhe Jia, Yilin Zhou, Mengjie Hu, Chun Liu & Qing Song
Institute of Computing Techonolgy, Chinese Academy of Sciences, Beijing, 100086, Beijing, China
Xuhan Zhu

Authors

Wenhe Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xuhan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yilin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Mengjie Hu
View author publications
You can also search for this author in PubMed Google Scholar
Chun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qing Song.

Ethics declarations

Conflict of Interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yilin Zhou, Mengjie Hu and Chun Liu are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jia, W., Zhu, X., Zhou, Y. et al. UV R-CNN: Stable and efficient dense human pose estimation. Multimed Tools Appl 83, 24699–24714 (2024). https://doi.org/10.1007/s11042-023-15379-w

Download citation

Received: 13 December 2022
Revised: 21 March 2023
Accepted: 15 April 2023
Published: 09 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-15379-w

UV R-CNN: Stable and efficient dense human pose estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-Scale Structure-Aware Network for Human Pose Estimation

Lightweight human pose estimation: CVC-net

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

UV R-CNN: Stable and efficient dense human pose estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-Scale Structure-Aware Network for Human Pose Estimation

Lightweight human pose estimation: CVC-net

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation