Abstract
TV shows depict a wide variety of human behaviors and have been studied extensively for their potential to be a rich source of data for many applications. However, the majority of the existing work focuses on 2D recognition tasks. In this paper, we make the observation that there is a certain persistence in TV shows, i.e., repetition of the environments and the humans, which makes possible the 3D reconstruction of this content. Building on this insight, we propose an automatic approach that operates on an entire season of a TV show and aggregates information in 3D; we build a 3D model of the environment, compute camera information, static 3D scene structure and body scale information. Then, we demonstrate how this information acts as rich 3D context that can guide and improve the recovery of 3D human pose and position in these environments. Moreover, we show that reasoning about humans and their environment in 3D enables a broad range of downstream applications: re-identification, gaze estimation, cinematography and image editing. We apply our approach on environments from seven iconic TV shows and perform an extensive evaluation of the proposed system.
G. Pavlakos and E. Weber—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., et al.: Building Rome in a day. In: ICCV (2009)
Arandjelovic, O., Zisserman, A.: Automatic face recognition for film character retrieval in feature-length films. In: CVPR (2005)
Arijon, D.: Grammar of the Film Language. Hastings House, New York (1976)
Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: interactive exploration of casually captured videos. ACM Trans. Graph. (TOG) 29(4), 1–11 (2010)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: ECCV (2016)
Brown, A., Kalogeiton, V., Zisserman, A.: Face, body, voice: video person-clustering with multiple modalities. In: ICCVW (2021)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. In: PAMI (2019)
Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” - automatic naming of characters in TV video. In: BMVC (2006)
Everingham, M., Zisserman, A.: Identifying individuals in video by combining generative and discriminative head models. In: ICCV (2005)
Ferrari, V., Marín-Jiménez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)
Ferrari, V., Marín-Jiménez, M., Zisserman, A.: Pose search: retrieving people using their pose. In: CVPR (2009)
Fu, D., et al.: Unsupervised pre-training for person re-identification. In: CVPR (2021)
Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: ICCV (2021)
Geman, S., McClure, D.E.: Statistical methods for tomographic image reconstruction. Bull. Int. Stat. Inst. 4, 5–21 (1987)
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: CVPR (2019)
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: CVPR (2021)
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Hoai, M., Zisserman, A.: Talking heads: detecting humans and recognizing their interactions. In: CVPR (2014)
Homayounfar, N., Fidler, S., Urtasun, R.: Sports field localization via deep structured models. In: CVPR (2017)
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: ECCV (2018)
Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: MovieNet: a holistic dataset for movie understanding. In: ECCV (2020)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. In: PAMI (2013)
Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation. In: ICCV (2021)
Kocabas, M., Huang, C.H.P., Tesch, J., Muller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)
Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.: Probabilistic modeling for human mesh recovery. In: ICCV (2021)
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR (2021)
Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: CVPR (2021)
Liu, M., Yang, D., Zhang, Y., Cui, Z., Rehg, J.M., Tang, S.: 4D human body capture from egocentric video via 3D scene grounding. In: 3DV (2021)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. ACM Trans. Graph. (TOG) 39(4), 71–1 (2020)
Marín-Jiménez, M.J., Kalogeiton, V., Medina-Suárez, P., Zisserman, A.: LAEO-Net++: revisiting people looking at each other in videos. In: PAMI (2021)
Marín-Jiménez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. In: IJCV (2014)
Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy, A., Duckworth, D.: NeRF in the wild: neural radiance fields for unconstrained photo collections. In: CVPR (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Mustafa, A., Volino, M., Kim, H., Guillemaut, J.Y., Hilton, A.: Temporally coherent general dynamic scene reconstruction. In: IJCV (2021)
Nagrani, A., Zisserman, A.: From benedict Cumberbatch to Sherlock Holmes: character identification in TV series without a script. In: BMVC (2017)
Ng, E., Ginosar, S., Darrell, T., Joo, H.: Body2Hands: learning to infer 3D hands from conversational gesture body dynamics. In: CVPR (2021)
Oechsle, M., Peng, S., Geiger, A.: UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In: ICCV (2021)
Park, K., et al.: Nerfies: deformable neural radiance fields. In: ICCV (2021)
Parkhi, O.M., Rahtu, E., Cao, Q., Zisserman, A.: Automated video face labelling for films and TV material. In: PAMI (2018)
Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in TV shows. In: PAMI (2012)
Pavlakos, G., Malik, J., Kanazawa, A.: Human mesh recovery from multiple shots. In: CVPR (2022)
Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: ICCV (2017)
Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.: Soccer on your tabletop. In: CVPR (2018)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3d human motion model for robust pose estimation. In: ICCV (2021)
Rempe, D., Guibas, L.J., Hertzmann, A., Russell, B., Villegas, R., Yang, J.: Contact and human dynamics from monocular video. In: ECCV (2020)
Savardi, M., Kovács, A.B., Signoroni, A., Benini, S.: CineScale: a dataset of cinematic shot scale in movies. Data Brief 36, 107002 (2021)
Savardi, M., Signoroni, A., Migliorati, P., Benini, S.: Shot scale analysis in movies by convolutional neural networks. In: ICIP (2018)
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: learning interaction snapshots from observations. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)
Shimada, S., Golyanik, V., Xu, W., Pérez, P., Theobalt, C.: Neural monocular 3D human motion capture with physical awareness. ACM Trans. Graph. (TOG) 40(4), 1–15 (2021)
Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3D motion capture in real time. ACM Trans. Graph. (TOG) 39(6), 1–16 (2020)
Sivic, J., Everingham, M., Zisserman, A.: “Who are you?” - Learning person specific classifiers from video. In: CVPR (2009)
Tapaswi, M., Law, M.T., Fidler, S.: Video face clustering with unknown number of clusters. In: ICCV (2019)
Tyszkiewicz, M.J., Fua, P., Trulls, E.: DISK: learning local features with policy gradient. In: NeurIPS (2020)
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: CVPR (2016)
Wang, X., Girdhar, R., Gupta, A.: Binge watching: Scaling affordance learning from sitcoms. In: CVPR (2017)
Weng, Z., Yeung, S.: Holistic 3D human and scene mesh estimation from single view images. In: CVPR (2021)
Xie, K., Wang, T., Iqbal, U., Guo, Y., Fidler, S., Shkurti, F.: Physics-based human motion estimation and synthesis from videos. In: ICCV (2021)
Xu, X., Joo, H., Mori, G., Savva, M.: D3D-HOI: dynamic 3d human-object interactions from videos. arXiv preprint arXiv:2108.08420 (2021)
Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. In: PAMI (2012)
Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: NeurIPS (2021)
Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: SimPoE: simulated character control for 3D human pose estimation. In: CVPR (2021)
Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3D human-object spatial arrangements from a single image in the wild. In: ECCV (2020)
Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: ICCV (2021)
Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K.G., Daniilidis, K.: MonoCap: monocular human motion capture using a CNN coupled with a geometric prior. In: PAMI (2018)
Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: ECCV (2020)
Acknowledgements
This research was supported by the DARPA Machine Common Sense program as well as BAIR/BDD sponsors. Matthew Tancik is supported by the NSF GRFP.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pavlakos, G., Weber, E., Tancik, M., Kanazawa, A. (2022). The One Where They Reconstructed 3D Humans and Environments in TV Shows. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)