Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Monocular human depth estimation with 3D motion flow and surface normals

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

We propose a novel monocular human depth estimation method using video sequences as training data. We jointly train the depth and 3D motion flow networks with photometric and 3D geometric consistency constraints. Instead of depth ground truth, we take the surface normal as the pseudo-label to supervise the depth network learning. The estimated depth may exist texture copy artifact when the clothes on the human body have patterns and text marks (non-dominant color). Thus, we also propose an approach to alleviate the texture copy problem by estimating and adjusting the color of non-dominant color areas. Extensive experiments on public datasets and the Internet have been conducted. The comparison results prove that our method can produce competitive human depth estimation and has better generalization ability than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are openly available in the public data repository at: TikTok [19]: https://www.yasamin.page/hdnet_tiktok; and Tan [45]: https://github.com/sfu-gruvi-3dv/deep_human THuman2.0 [55]: https://github.com/ytrock/THuman2.0-Dataset.

References

  1. https://www.remove.bg/upload

  2. http://nghiaho.com/?page_id=671

  3. Aleotti, F., Poggi, M., Mattoccia, S.: Learning optical flow from still images. In: CVPR, pp. 15,196–15,206 (2021)

  4. Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: detailed full human body geometry from a single image. In: ICCV, pp. 2293–2303 (2019)

  5. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. ACM Trans. Gr. 24(3), 408–416 (2005)

    Article  Google Scholar 

  6. Arun, K.S., Huang, T.S., Blostein, S.D.: Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell. 5, 698–700 (1987)

    Article  Google Scholar 

  7. Bian, X., Wang, C., Quan, W., Ye, J., Zhang, X., Yan, D.M.: Scene text removal via cascaded text stroke detection and erasing. Comput. Vis. Media 8, 273–287 (2022)

    Article  Google Scholar 

  8. Chen, Z., Lu, X., Zhang, L., Xiao, C.: Semi-supervised video shadow detection via image-assisted pseudo-label generation. In: ACM MM, pp. 2700–2708 (2022)

  9. Feng, Q., Liu, Y., Lai, Y.K., Yang, J., Li, K.: Fof: Learning fourier occupancy field for monocular real-time human reconstruction. In: NeurIPS (2022)

  10. Gastal, E.S.L., Oliveira, M.M.: Domain transform for edge-aware image and video processing. ACM Trans. Gr. 30(4), 1–12 (2011)

    Article  Google Scholar 

  11. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 270–279 (2017)

  12. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3828–3838 (2019)

  13. Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: CVPR, pp. 7297–7306 (2018)

  14. Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., Theobalt, C.: Deepcap: Monocular human performance capture using weak supervision. In: CVPR, pp. 5052–5063 (2020)

  15. He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2012)

    Article  Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

  17. Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo sequences. In: ICCV, pp. 1–7 (2007)

  18. Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: CVPR, pp. 7396–7405 (2020)

  19. Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: CVPR, pp. 12,753–12,762 (2021)

  20. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR, pp. 7122–7131 (2018)

  21. Kingma, D.P., Ba, J.L.: Adam: A method for stochastic optimization. In: ICLR (2015)

  22. Krishna, K., Murty, M.N.: Genetic k-means algorithm. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 29(3), 433–439 (1999)

  23. Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 6647–6655 (2017)

  24. Lahner, Z., Cremers, D., Tung, T.: Deepwrinkles: accurate and realistic clothing modeling. In: ECCV, pp. 667–684 (2018)

  25. Lazova, V., Insafutdinov, E., Pons-Moll, G.: 360-degree textures of people in clothing from a single image. In: 3DV, pp. 643–653 (2019)

  26. Li, Y., Luo, F., Li, W., Zheng, S., Wu, H.h., Xiao, C.: Self-supervised monocular depth estimation based on image texture detail enhancement. The Visual Comput. 37(9), 2567–2580 (2021)

  27. Li, Y., Luo, F., Xiao, C.: Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Comput. Vis. Media 8(4), 631–647 (2022)

    Article  Google Scholar 

  28. Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: CVPR, pp. 4521–4530 (2019)

  29. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR, pp. 6498–6508 (2021)

  30. Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)

    Article  Google Scholar 

  31. Liu, X., Qi, C.R., Guibas, L.J.: Flownet 3D: Learning scene flow in 3D point clouds. In: CVPR, pp. 529–537 (2019)

  32. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Trans. Gr. 34(6), 1–16 (2015)

    Article  Google Scholar 

  33. Luo, F., Wei, L., Xiao, C.: Stable depth estimation within consecutive video frames. In: CGI, pp. 54–66 (2021)

  34. Luo, F., Zhu, Y., Fu, Y., Zhou, H., Chen, Z., Xiao, C.: Sparse rgb-d images create a real thing: a flexible voxel based 3d reconstruction pipeline for single object. Vis. Inf. 7(1), 66–76 (2023)

    Google Scholar 

  35. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 405–421 (2020)

  36. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV, pp. 483–499 (2016)

  37. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10,975–10,985 (2019)

  38. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10,975–10,985 (2019)

  39. Petrovai, A., Nedevschi, S.: Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: CVPR, pp. 1578–1588 (2022)

  40. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)

  41. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV, pp. 2304–2314 (2019)

  42. Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: CVPR, pp. 84–93 (2020)

  43. Schuster, R., Wasenmuller, O., Kuschk, G., Bailer, C., Stricker, D.: Sceneflowfields: dense interpolation of sparse scene flow correspondences. In: WACV, pp. 1056–1065 (2018)

  44. She, D., Xu, K.: An image-to-video model for real-time video enhancement. In: ACM MM, pp. 1837–1846 (2022)

  45. Tang, S., Tan, F., Cheng, K., Li, Z., Zhu, S., Tan, P.: A neural network for detailed human depth estimation from a single image. In: ICCV, pp. 7750–7759 (2019)

  46. Teed, Z., Deng, J.: Raft-3D: scene flow using rigid-motion embeddings. In: CVPR, pp. 8375–8384 (2021)

  47. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  48. Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene flow. In: ICCV, pp. 722–729 (1999)

  49. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  50. Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: Flownet3d++: geometric losses for deep scene flow estimation. In: WACV, pp. 91–98 (2020)

  51. Wei, Y., Wang, Z., Rao, Y., Lu, J., Zhou, J.: Pv-raft: point-voxel correlation fields for scene flow estimation of point clouds. In: CVPR, pp. 6954–6963 (2021)

  52. Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: implicit clothed humans obtained from normals. In: CVPR, pp. 13,286–13,296 (2022)

  53. Yang, G., Ramanan, D.: Learning to segment rigid motions from two frames. In: CVPR, pp. 1266–1275 (2021)

  54. Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer rgbd sensors. In: CVPR, pp. 5746–5756 (2021)

  55. Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer rgbd sensors. In: CVPR (2021)

  56. Zhang, F., Li, Y., You, S., Fu, Y.: Learning temporal consistency for low light video enhancement from single images. In: CVPR, pp. 4967–4976 (2021)

  57. Zhang, W., Yan, Q., Xiao, C.: Detail preserved point cloud completion via separated feature aggregation. In: ECCV, pp. 512–528 (2020)

  58. Zhang, X., Ge, Y., Qiao, Y., Li, H.: Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: CVPR, pp. 3436–3445 (2021)

  59. Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM Trans. Gr. 40(4), 1–12 (2021)

    Google Scholar 

  60. Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3170–3184 (2022)

    Article  Google Scholar 

  61. Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3D human reconstruction from a single image. In: CVPR, pp. 7739–7749 (2019)

  62. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 1851–1858 (2017)

Download references

Acknowledgements

This work is partially supported by NSFC (No. 61972298), Bingtuan Science and Technology Program (No. 2019BC008), and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fei Luo or Chunxia Xiao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was co-supervised by Chun-Xia Xiao and Fei Luo.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Luo, F. & Xiao, C. Monocular human depth estimation with 3D motion flow and surface normals. Vis Comput 39, 3701–3713 (2023). https://doi.org/10.1007/s00371-023-02995-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02995-8

Keywords

Navigation