Monocular human depth estimation with 3D motion flow and surface normals

Yuanzhen Li¹,
Fei Luo¹ &
Chunxia Xiao¹

297 Accesses
Explore all metrics

Abstract

We propose a novel monocular human depth estimation method using video sequences as training data. We jointly train the depth and 3D motion flow networks with photometric and 3D geometric consistency constraints. Instead of depth ground truth, we take the surface normal as the pseudo-label to supervise the depth network learning. The estimated depth may exist texture copy artifact when the clothes on the human body have patterns and text marks (non-dominant color). Thus, we also propose an approach to alleviate the texture copy problem by estimating and adjusting the color of non-dominant color areas. Extensive experiments on public datasets and the Internet have been conducted. The comparison results prove that our method can produce competitive human depth estimation and has better generalization ability than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monocular depth estimation based on deep learning: An overview

Article 10 June 2020

Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Article 03 February 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data that support the findings of this study are openly available in the public data repository at: TikTok [19]: https://www.yasamin.page/hdnet_tiktok; and Tan [45]: https://github.com/sfu-gruvi-3dv/deep_human THuman2.0 [55]: https://github.com/ytrock/THuman2.0-Dataset.

References

https://www.remove.bg/upload
http://nghiaho.com/?page_id=671
Aleotti, F., Poggi, M., Mattoccia, S.: Learning optical flow from still images. In: CVPR, pp. 15,196–15,206 (2021)
Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: detailed full human body geometry from a single image. In: ICCV, pp. 2293–2303 (2019)
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. ACM Trans. Gr. 24(3), 408–416 (2005)
Article Google Scholar
Arun, K.S., Huang, T.S., Blostein, S.D.: Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell. 5, 698–700 (1987)
Article Google Scholar
Bian, X., Wang, C., Quan, W., Ye, J., Zhang, X., Yan, D.M.: Scene text removal via cascaded text stroke detection and erasing. Comput. Vis. Media 8, 273–287 (2022)
Article Google Scholar
Chen, Z., Lu, X., Zhang, L., Xiao, C.: Semi-supervised video shadow detection via image-assisted pseudo-label generation. In: ACM MM, pp. 2700–2708 (2022)
Feng, Q., Liu, Y., Lai, Y.K., Yang, J., Li, K.: Fof: Learning fourier occupancy field for monocular real-time human reconstruction. In: NeurIPS (2022)
Gastal, E.S.L., Oliveira, M.M.: Domain transform for edge-aware image and video processing. ACM Trans. Gr. 30(4), 1–12 (2011)
Article Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 270–279 (2017)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3828–3838 (2019)
Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: CVPR, pp. 7297–7306 (2018)
Habermann, M., Xu, W., Zollhofer, M., Pons-Moll, G., Theobalt, C.: Deepcap: Monocular human performance capture using weak supervision. In: CVPR, pp. 5052–5063 (2020)
He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2012)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo sequences. In: ICCV, pp. 1–7 (2007)
Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: CVPR, pp. 7396–7405 (2020)
Jafarian, Y., Park, H.S.: Learning high fidelity depths of dressed humans by watching social media dance videos. In: CVPR, pp. 12,753–12,762 (2021)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR, pp. 7122–7131 (2018)
Kingma, D.P., Ba, J.L.: Adam: A method for stochastic optimization. In: ICLR (2015)
Krishna, K., Murty, M.N.: Genetic k-means algorithm. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 29(3), 433–439 (1999)
Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 6647–6655 (2017)
Lahner, Z., Cremers, D., Tung, T.: Deepwrinkles: accurate and realistic clothing modeling. In: ECCV, pp. 667–684 (2018)
Lazova, V., Insafutdinov, E., Pons-Moll, G.: 360-degree textures of people in clothing from a single image. In: 3DV, pp. 643–653 (2019)
Li, Y., Luo, F., Li, W., Zheng, S., Wu, H.h., Xiao, C.: Self-supervised monocular depth estimation based on image texture detail enhancement. The Visual Comput. 37(9), 2567–2580 (2021)
Li, Y., Luo, F., Xiao, C.: Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Comput. Vis. Media 8(4), 631–647 (2022)
Article Google Scholar
Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: CVPR, pp. 4521–4530 (2019)
Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR, pp. 6498–6508 (2021)
Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)
Article Google Scholar
Liu, X., Qi, C.R., Guibas, L.J.: Flownet 3D: Learning scene flow in 3D point clouds. In: CVPR, pp. 529–537 (2019)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Trans. Gr. 34(6), 1–16 (2015)
Article Google Scholar
Luo, F., Wei, L., Xiao, C.: Stable depth estimation within consecutive video frames. In: CGI, pp. 54–66 (2021)
Luo, F., Zhu, Y., Fu, Y., Zhou, H., Chen, Z., Xiao, C.: Sparse rgb-d images create a real thing: a flexible voxel based 3d reconstruction pipeline for single object. Vis. Inf. 7(1), 66–76 (2023)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: ECCV, pp. 405–421 (2020)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV, pp. 483–499 (2016)
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10,975–10,985 (2019)
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10,975–10,985 (2019)
Petrovai, A., Nedevschi, S.: Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: CVPR, pp. 1578–1588 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV, pp. 2304–2314 (2019)
Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In: CVPR, pp. 84–93 (2020)
Schuster, R., Wasenmuller, O., Kuschk, G., Bailer, C., Stricker, D.: Sceneflowfields: dense interpolation of sparse scene flow correspondences. In: WACV, pp. 1056–1065 (2018)
She, D., Xu, K.: An image-to-video model for real-time video enhancement. In: ACM MM, pp. 1837–1846 (2022)
Tang, S., Tan, F., Cheng, K., Li, Z., Zhu, S., Tan, P.: A neural network for detailed human depth estimation from a single image. In: ICCV, pp. 7750–7759 (2019)
Teed, Z., Deng, J.: Raft-3D: scene flow using rigid-motion embeddings. In: CVPR, pp. 8375–8384 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene flow. In: ICCV, pp. 722–729 (1999)
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: Flownet3d++: geometric losses for deep scene flow estimation. In: WACV, pp. 91–98 (2020)
Wei, Y., Wang, Z., Rao, Y., Lu, J., Zhou, J.: Pv-raft: point-voxel correlation fields for scene flow estimation of point clouds. In: CVPR, pp. 6954–6963 (2021)
Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: implicit clothed humans obtained from normals. In: CVPR, pp. 13,286–13,296 (2022)
Yang, G., Ramanan, D.: Learning to segment rigid motions from two frames. In: CVPR, pp. 1266–1275 (2021)
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer rgbd sensors. In: CVPR, pp. 5746–5756 (2021)
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., Liu, Y.: Function4D: real-time human volumetric capture from very sparse consumer rgbd sensors. In: CVPR (2021)
Zhang, F., Li, Y., You, S., Fu, Y.: Learning temporal consistency for low light video enhancement from single images. In: CVPR, pp. 4967–4976 (2021)
Zhang, W., Yan, Q., Xiao, C.: Detail preserved point cloud completion via separated feature aggregation. In: ECCV, pp. 512–528 (2020)
Zhang, X., Ge, Y., Qiao, Y., Li, H.: Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: CVPR, pp. 3436–3445 (2021)
Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM Trans. Gr. 40(4), 1–12 (2021)
Google Scholar
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3170–3184 (2022)
Article Google Scholar
Zheng, Z., Yu, T., Wei, Y., Dai, Q., Liu, Y.: Deephuman: 3D human reconstruction from a single image. In: CVPR, pp. 7739–7749 (2019)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 1851–1858 (2017)

Download references

Acknowledgements

This work is partially supported by NSFC (No. 61972298), Bingtuan Science and Technology Program (No. 2019BC008), and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China
Yuanzhen Li, Fei Luo & Chunxia Xiao

Authors

Yuanzhen Li
View author publications
You can also search for this author in PubMed Google Scholar
Fei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chunxia Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fei Luo or Chunxia Xiao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was co-supervised by Chun-Xia Xiao and Fei Luo.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Y., Luo, F. & Xiao, C. Monocular human depth estimation with 3D motion flow and surface normals. Vis Comput 39, 3701–3713 (2023). https://doi.org/10.1007/s00371-023-02995-8

Download citation

Accepted: 12 June 2023
Published: 22 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00371-023-02995-8

Monocular human depth estimation with 3D motion flow and surface normals

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Monocular depth estimation based on deep learning: An overview

Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Monocular human depth estimation with 3D motion flow and surface normals

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Monocular depth estimation based on deep learning: An overview

Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation