Nothing Special   »   [go: up one dir, main page]

Skip to main content

The One Where They Reconstructed 3D Humans and Environments in TV Shows

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

Abstract

TV shows depict a wide variety of human behaviors and have been studied extensively for their potential to be a rich source of data for many applications. However, the majority of the existing work focuses on 2D recognition tasks. In this paper, we make the observation that there is a certain persistence in TV shows, i.e., repetition of the environments and the humans, which makes possible the 3D reconstruction of this content. Building on this insight, we propose an automatic approach that operates on an entire season of a TV show and aggregates information in 3D; we build a 3D model of the environment, compute camera information, static 3D scene structure and body scale information. Then, we demonstrate how this information acts as rich 3D context that can guide and improve the recovery of 3D human pose and position in these environments. Moreover, we show that reasoning about humans and their environment in 3D enables a broad range of downstream applications: re-identification, gaze estimation, cinematography and image editing. We apply our approach on environments from seven iconic TV shows and perform an extensive evaluation of the proposed system.

G. Pavlakos and E. Weber—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, S., et al.: Building Rome in a day. In: ICCV (2009)

    Google Scholar 

  2. Arandjelovic, O., Zisserman, A.: Automatic face recognition for film character retrieval in feature-length films. In: CVPR (2005)

    Google Scholar 

  3. Arijon, D.: Grammar of the Film Language. Hastings House, New York (1976)

    Google Scholar 

  4. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. ACM Trans. Graph. (TOG) 29(4), 1–11 (2010)

    Article  Google Scholar 

  5. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)

    Google Scholar 

  6. Brown, A., Kalogeiton, V., Zisserman, A.: Face, body, voice: video person-clustering with multiple modalities. In: ICCVW (2021)

    Google Scholar 

  7. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. In: PAMI (2019)

    Google Scholar 

  8. Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” - automatic naming of characters in TV video. In: BMVC (2006)

    Google Scholar 

  9. Everingham, M., Zisserman, A.: Identifying individuals in video by combining generative and discriminative head models. In: ICCV (2005)

    Google Scholar 

  10. Ferrari, V., Marín-Jiménez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)

    Google Scholar 

  11. Ferrari, V., Marín-Jiménez, M., Zisserman, A.: Pose search: retrieving people using their pose. In: CVPR (2009)

    Google Scholar 

  12. Fu, D., et al.: Unsupervised pre-training for person re-identification. In: CVPR (2021)

    Google Scholar 

  13. Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: ICCV (2021)

    Google Scholar 

  14. Geman, S., McClure, D.E.: Statistical methods for tomographic image reconstruction. Bull. Int. Stat. Inst. 4, 5–21 (1987)

    MathSciNet  Google Scholar 

  15. Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: CVPR (2019)

    Google Scholar 

  16. Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)

    Google Scholar 

  17. Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)

    Google Scholar 

  18. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  19. Hoai, M., Zisserman, A.: Talking heads: detecting humans and recognizing their interactions. In: CVPR (2014)

    Google Scholar 

  20. Homayounfar, N., Fidler, S., Urtasun, R.: Sports field localization via deep structured models. In: CVPR (2017)

    Google Scholar 

  21. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)

    Google Scholar 

  22. Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: ECCV (2018)

    Google Scholar 

  23. Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: ECCV (2020)

    Google Scholar 

  24. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. In: PAMI (2013)

    Google Scholar 

  25. Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)

    Google Scholar 

  26. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation. In: ICCV (2021)

    Google Scholar 

  27. Kocabas, M., Huang, C.H.P., Tesch, J., Muller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)

    Google Scholar 

  28. Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.: Probabilistic modeling for human mesh recovery. In: ICCV (2021)

    Google Scholar 

  29. Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR (2021)

    Google Scholar 

  30. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)

    Google Scholar 

  31. Liu, M., Yang, D., Zhang, Y., Cui, Z., Rehg, J.M., Tang, S.: 4D human body capture from egocentric video via 3D scene grounding. In: 3DV (2021)

    Google Scholar 

  32. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)

    Article  Google Scholar 

  33. Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. ACM Trans. Graph. (TOG) 39(4), 71–1 (2020)

    Article  Google Scholar 

  34. Marín-Jiménez, M.J., Kalogeiton, V., Medina-Suárez, P., Zisserman, A.: LAEO-Net++: revisiting people looking at each other in videos. In: PAMI (2021)

    Google Scholar 

  35. Marín-Jiménez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. In: IJCV (2014)

    Google Scholar 

  36. Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: NeRF in the wild: neural radiance fields for unconstrained photo collections. In: CVPR (2021)

    Google Scholar 

  37. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)

    Google Scholar 

  38. Mustafa, A., Volino, M., Kim, H., Guillemaut, J.Y., Hilton, A.: Temporally coherent general dynamic scene reconstruction. In: IJCV (2021)

    Google Scholar 

  39. Nagrani, A., Zisserman, A.: From benedict Cumberbatch to Sherlock Holmes: character identification in TV series without a script. In: BMVC (2017)

    Google Scholar 

  40. Ng, E., Ginosar, S., Darrell, T., Joo, H.: Body2Hands: learning to infer 3D hands from conversational gesture body dynamics. In: CVPR (2021)

    Google Scholar 

  41. Oechsle, M., Peng, S., Geiger, A.: UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In: ICCV (2021)

    Google Scholar 

  42. Park, K., et al.: Nerfies: deformable neural radiance fields. In: ICCV (2021)

    Google Scholar 

  43. Parkhi, O.M., Rahtu, E., Cao, Q., Zisserman, A.: Automated video face labelling for films and TV material. In: PAMI (2018)

    Google Scholar 

  44. Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in TV shows. In: PAMI (2012)

    Google Scholar 

  45. Pavlakos, G., Malik, J., Kanazawa, A.: Human mesh recovery from multiple shots. In: CVPR (2022)

    Google Scholar 

  46. Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: ICCV (2017)

    Google Scholar 

  47. Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.: Soccer on your tabletop. In: CVPR (2018)

    Google Scholar 

  48. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3d human motion model for robust pose estimation. In: ICCV (2021)

    Google Scholar 

  49. Rempe, D., Guibas, L.J., Hertzmann, A., Russell, B., Villegas, R., Yang, J.: Contact and human dynamics from monocular video. In: ECCV (2020)

    Google Scholar 

  50. Savardi, M., Kovács, A.B., Signoroni, A., Benini, S.: CineScale: a dataset of cinematic shot scale in movies. Data Brief 36, 107002 (2021)

    Article  Google Scholar 

  51. Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: ICIP (2018)

    Google Scholar 

  52. Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: learning interaction snapshots from observations. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)

    Article  Google Scholar 

  53. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  54. Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)

    Google Scholar 

  55. Shimada, S., Golyanik, V., Xu, W., Pérez, P., Theobalt, C.: Neural monocular 3D human motion capture with physical awareness. ACM Trans. Graph. (TOG) 40(4), 1–15 (2021)

    Article  Google Scholar 

  56. Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3D motion capture in real time. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)

    Article  Google Scholar 

  57. Sivic, J., Everingham, M., Zisserman, A.: “Who are you?” - Learning person specific classifiers from video. In: CVPR (2009)

    Google Scholar 

  58. Tapaswi, M., Law, M.T., Fidler, S.: Video face clustering with unknown number of clusters. In: ICCV (2019)

    Google Scholar 

  59. Tyszkiewicz, M.J., Fua, P., Trulls, E.: DISK: learning local features with policy gradient. In: NeurIPS (2020)

    Google Scholar 

  60. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)

    Google Scholar 

  61. Wang, X., Girdhar, R., Gupta, A.: Binge watching: Scaling affordance learning from sitcoms. In: CVPR (2017)

    Google Scholar 

  62. Weng, Z., Yeung, S.: Holistic 3D human and scene mesh estimation from single view images. In: CVPR (2021)

    Google Scholar 

  63. Xie, K., Wang, T., Iqbal, U., Guo, Y., Fidler, S., Shkurti, F.: Physics-based human motion estimation and synthesis from videos. In: ICCV (2021)

    Google Scholar 

  64. Xu, X., Joo, H., Mori, G., Savva, M.: D3D-HOI: dynamic 3d human-object interactions from videos. arXiv preprint arXiv:2108.08420 (2021)

  65. Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. In: PAMI (2012)

    Google Scholar 

  66. Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: NeurIPS (2021)

    Google Scholar 

  67. Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: SimPoE: simulated character control for 3D human pose estimation. In: CVPR (2021)

    Google Scholar 

  68. Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3D human-object spatial arrangements from a single image in the wild. In: ECCV (2020)

    Google Scholar 

  69. Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: ICCV (2021)

    Google Scholar 

  70. Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K.G., Daniilidis, K.: MonoCap: monocular human motion capture using a CNN coupled with a geometric prior. In: PAMI (2018)

    Google Scholar 

  71. Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: ECCV (2020)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the DARPA Machine Common Sense program as well as BAIR/BDD sponsors. Matthew Tancik is supported by the NSF GRFP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georgios Pavlakos .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16168 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pavlakos, G., Weber, E., Tancik, M., Kanazawa, A. (2022). The One Where They Reconstructed 3D Humans and Environments in TV Shows. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics