Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

UV R-CNN: Stable and efficient dense human pose estimation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As a dense prediction task aimed at instance-level human analysis, dense-pose estimation seeks to accurately map 2D pixels onto the 3D surface of the human body. Despite significant progress has been made, two major challenges continue to confront the research community: the first is training instability caused by a large number of surface points to be regressed; the second is the significant amount of time and computational resources to manually adjust multi-task loss weights. To overcome these challenges, we present a novel dense pose estimator, named UV R-CNN, which is based on a detailed analysis of the loss formulation used in existing algorithms. The proposed UV R-CNN first introduces a novel surface point regression loss, which serves to constrain the immense loss and stable the training progress, named Dense Points Loss (DP-Loss). Additionally, we incorporates a Balanced Weighting Strategy (BWS) that allows for the automatic adaptation of loss weights. Remarkably, without auxiliary supervision and external knowledge from other tasks, UV R-CNN can be trained with larger learning rate, achieving 65.0% APgps and 66.1% \(AP_{gps^{m}}\) on the DensePose-COCO validation subset with ResNet-50-FPN as backbone, competitive to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. https://github.com/facebookresearch/DensePose

References

  1. Bachmann R, Mizrahi D, Atanov A, Zamir A (2022) Multimae: Multi-modal multi-task masked autoencoders. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXXVII, pp. 348–367. Springer

  2. Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79(29-30):20483–20518

    Article  Google Scholar 

  3. Bakkouri I, Afdel K (2022) Mlca2f: Multi-level context attentional feature fusion for covid-19 lesion segmentation from ct scans. SIViP, 1–8

  4. Bakkouri I, Afdel K, Benois-Pineau J (2022) Initiative, G.C.F.t.A.D.N.: Bg-3dm2f: Bidirectional gated 3d multi-scale feature fusion for alzheimer’s disease diagnosis. Multimed Tools Appl 81(8):10743–10776

    Article  Google Scholar 

  5. Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11-14, 2016, proceedings, Part V 14, pp 561–578. Springer

  6. Boudjit K, Ramzan N (2022) Human detection based on deep learning yolo-v2 for real-time uav applications. J Exp Theor Artif Intell 34(3):527–544

    Article  Google Scholar 

  7. Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325

  8. Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

  9. Cipolla R, Gal Y, Kendall A (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7482–7491

  10. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International conference on computer vision (ICCV), pp 2650–2658

  11. Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 8359–8367

  12. Gkioxari G, Hariharan B, Girshick RB, Malik J (2014) R-cnns for pose estimation and action detection. arXiv:1406.5212

  13. Gong K, Liang X, Zhang D, Shen X, Lin L (2017) Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 932–940

  14. Güler RA, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7297–7306

  15. Guo Y, Gao L, Song J, Wang P, Xie W, Shen HT (2019) Adaptive multi-path aggregation for human densepose estimation in the wild. In: Proceedings of the 27th ACM International conference on multimedia, pp 356–364

  16. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778

  18. Hikmat A, Afdel K, Bakkouri I (2020) Automatic detection of stellate lesions in digital mammograms using multi-scale sift. J Pharm Pharmacol 8:24–34

    Google Scholar 

  19. Hwang D-H, Kim S, Monet N, Koike H, Bae S (2020) Lightweight 3d human pose estimation network training using teacher-student learning. In: 2020 IEEE Winter conference on applications of computer vision (WACV), pp 479–488

  20. Jin Y, Chen Y, Wang L, Wang J, Yu P, Liang L, Hwang J-N, Liu Z (2022) The overlooked classifier in human-object interaction recognition. arXiv:2203.05676

  21. Khirodkar R, Chari V, Agrawal A, Tyagi A (2021) Multi-instance pose networks: Rethinking top-down pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 3122–3131

  22. Li W-H, Liu X, Bilen H (2021) Universal representation learning from multiple domains for few-shot classification. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 9526–9535

  23. Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L (2022) Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13619–13627

  24. Liao Y, Zhang A, Lu M, Wang Y, Li X, Liu S (2022) Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 20123–20132

  25. Lin T-Y, Dollar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 936–944

  26. Liu K, Choi O, Wang J, Hwang W (2022) Cdgnet: Class distribution guided network for human parsing. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 4473–4482

  27. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3431–3440

  28. Ma L, Liu L, Theobalt C, Van Gool L (2021) Direct dense pose estimation. In: 2021 International Conference on 3D Vision (3DV), pp 721–730. IEEE

  29. Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3994–4003

  30. Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 7753–7762

  31. Rebuffi S-A, Bilen H, Vedaldi A (2017) Learning multiple visual domains with residual adapters. In: NIPS’17 Proceedings of the 31st International conference on neural information processing systems, pp 506–516

  32. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28

  33. Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 9626–9635

  34. Vandenhende S, Georgoulis S, Van Gansbeke W, Proesmans M, Dai D, Van Gool L (2021) Multi-task learning for dense prediction tasks: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3614–3633

    Google Scholar 

  35. Varga LA, Kiefer B, Messmer M, Zell A (2022) Seadronessee: a maritime benchmark for detecting humans in open water. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 2260–2270

  36. Wang X, Gao L, Song J, Shen HT (2020) Ktn: Knowledge transfer network for multi-person densepose estimation. In: Proceedings of the 28th ACM International conference on multimedia, pp 3780–3788

  37. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp 7794–7803

  38. Wang W, Zhou T, Qi S, Shen J, Zhu S-C (2021) Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans Pattern Anal Mach Intell 44(7):3508–3522

    Google Scholar 

  39. Wu X, Li Y-L, Liu X, Zhang J, Wu Y, Lu C (2022) Mining cross-person cues for body-part interactiveness learning in hoi detection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp 121–136. Springer

  40. Yang L, Liu Z, Zhou T, Song Q (2022) Part decomposition and refinement network for human parsing. IEEE/CAA Journal of Automatica Sinica 9 (6):1111–1114

    Article  Google Scholar 

  41. Yang L, Song Q, Wang Z, Hu M, Liu C (2020) Hier r-cnn: Instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54

    Article  PubMed  ADS  Google Scholar 

  42. Yang L, Song Q, Wang Z, Hu M, Liu C, Xin X, Jia W, Xu S (2020) Renovating parsing r-cnn for accurate multiple human parsing. In: European Conference on computer vision, pp 421–437. Springer

  43. Yang L, Song Q, Wang Z, Jiang M (2019) Parsing r-cnn for instance-level human analysis. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 364–373

  44. Yang L, Song Q, Wu Y, Hu M (2018) Attention inspiring receptive-fields network for learning invariant representations. IEEE Trans Neural Netw Learn Syst 30(6):1744–1755

    Article  PubMed  Google Scholar 

  45. Ye H, Xu D (2022) Inverted pyramid multi-task transformer for dense scene understanding. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, proceedings, Part XXVII, pp. 514–530. Springer

  46. Yuan H, Wang M, Ni D, Xu L (2022) Detecting human-object interactions with object-guided cross-modal calibrated semantics. In: Proceedings of the AAAI Conference on artificial intelligence, vol 36, pp 3206–3214

  47. Zauss D, Kreiss S, Alahi A (2021) Keypoint communities. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 11057–11066

  48. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: 13th european conference on computer vision, ECCV 2014, pp 818–833

  49. Zeng A, Ju X, Yang L, Gao R, Zhu X, Dai B, Xu Q (2022) Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pp 607–624. Springer

  50. Zhang X, Chen Y, Tang M, Wang J, Zhu X, Lei Z (2022) Human parsing with part-aware relation modeling. IEEE Transactions on Multimedia

  51. Zhang Q, Jiang Y, Zhou Q, Zhao Y, Liu Y, Lu H, Hua X-S (2021) Single person dense pose estimation via geometric equivariance consistency. IEEE Transactions on Multimedia

  52. Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 129:3069–3087

    Article  Google Scholar 

  53. Zhao J, Li J, Cheng Y, Sim T, Yan S, Feng J (2018) Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In: Proceedings of the 26th ACM International conference on multimedia, pp 792–800

  54. Zhao Y, Li J, Zhang Y, Tian Y (2022) From pose to part: Weakly-supervised pose evolution for human part segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence

  55. Zhao W, Li C, Zhang W, Yang L, Zhuang P, Li L, Fan K, Yang H (2022) Embedding global contrastive and local location in self-supervised learning. IEEE Transactions on Circuits and Systems for Video Technology, 1–1. https://doi.org/10.1109/TCSVT.2022.3221611

  56. Zhu X, Song Q (2021) Joint model for human body part instance segmentation and densepose estimation. In: 2021 9Th international conference on communications and broadband networking, pp 66–73

  57. Zhu B, Song Q, Yang L, Wang Z, Liu C, Hu M (2021) Cpm r-cnn: Calibrating point-guided misalignment in object detection. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 3248–3257

  58. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Song.

Ethics declarations

Conflict of Interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yilin Zhou, Mengjie Hu and Chun Liu are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, W., Zhu, X., Zhou, Y. et al. UV R-CNN: Stable and efficient dense human pose estimation. Multimed Tools Appl 83, 24699–24714 (2024). https://doi.org/10.1007/s11042-023-15379-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15379-w

Keywords

Navigation