Abstract
Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment – imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods. Project page: https://people.eecs.berkeley.edu/~zhecao/hmp/index.html (Please refer to our arXiv for a longer version of the paper with more visualizations.)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We choose to represent the scene by RGB images rather than RGBD scans because they are more readily available in many practical applications.
- 2.
Dataset available in https://github.com/ZheC/GTA-IM-Dataset
References
CMU Motion Capture Database. http://mocap.cs.cmu.edu
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: NIPS (2009)
Akhter, I., Simon, T., Khan, S., Matthews, I., Sheikh, Y.: Bilinear spatiotemporal basis models. SIGGRAPH (2012)
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)
Alahi, A., Ramanathan, V., Fei-Fei, L.: Socially-aware large-scale crowd forecasting. In: CVPR (2014)
Alexopoulos, C., Griffin, P.M.: Path planning for a mobile robot. IEEE Trans. Syst. Man Cybern. (1992)
Brand, M., Hertzmann, A.: Style machines. SIGGRAPH (2000)
Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019)
Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR (2017)
Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In: ICCV (2019)
Chiu, H.K., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: WACV (2019)
Elhayek, A., Stoll, C., Hasler, N., Kim, K.I., Seidel, H.P., Theobalt, C.: Spatio-temporal motion tracking with unsynchronized cameras. In: CVPR (2012)
Fabbri, M., Lanzi, F., Calderara, S., Palazzi, A., Vezzani, R., Cucchiara, R.: Learning to detect and track visible and occluded body joints in a virtual world. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 450–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_27
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 3DV (2017)
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)
Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E (1995)
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: CVPR (2019)
Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asian Technical Briefs (2015)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI (2013)
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2014)
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Krähenbühl, P.: Free supervision from video games. In: CVPR (2018)
LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006)
Law, H., Teng, Y., Russakovsky, O., Deng, J.: CornerNet-Lite: efficient keypoint based object detection. arXiv preprint arXiv:1904.08900 (2019)
Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., Kautz, J.: Context-aware synthesis and placement of object instances. In: NIPS (2018)
Lerner, A., Chrysanthou, Y., Lischinski, D.: Crowds by example. In: CGF (2007)
Li, C., Zhang, Z., Sun Lee, W., Hee Lee, G.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)
Li, X., Liu, S., Kim, K., Wang, X., Yang, M.H., Kautz, J.: Putting humans in a scene: learning affordance in 3D indoor environments. In: CVPR (2019)
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR (2018)
Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: CVPR (2017)
Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In: CVPR (2019)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.J.: iMapper: interaction-guided joint scene and human motion mapping from monocular videos. SIGGRAPH (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. In: BMVC (2018)
Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: NIPS (2001)
Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: modeling social behavior for multi-target tracking. In: CVPR (2009)
Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. In: CVPR (2019)
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: Learning Interaction Snapshots from Observations. TOG (2016)
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Tai, L., Zhang, J., Liu, M., Burgard, W.: Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In: ICRA (2018)
Tay, M.K.C., Laugier, C.: Modelling smooth paths using gaussian processes. In: Laugier, C., Siegwart, R. (eds.) Field and Service Robotics, pp. 381–390. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-75404-6_36
Treuille, A., Cooper, S., Popović, Z.: Continuum crowds. TOG (2006)
Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)
Vo, M., Narasimhan, S.G., Sheikh, Y.: Spatiotemporal bundle adjustment for dynamic 3D reconstruction. In: CVPR (2016)
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: CVPR (2017)
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. TPAMI (2007)
Wang, J.M., Fleet, D.J., Hertzmann, A.: Multifactor gaussian process models for style-content separation. In: ICML (2007)
Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)
Wang, Z., Chen, L., Rathore, S., Shin, D., Fowlkes, C.: Geometric pose affordance: 3D human pose with scene constraints. arXiv preprint arXiv:1905.07718 (2019)
Wang, Z., Shin, D., Fowlkes, C.C.: Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation. arXiv preprint arXiv:2004.03143 (2020)
Wei, M., Miaomiao, L., Mathieu, S., Hongdong, L.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. IROS (2018)
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: ICCV (2019)
Zhao, L., Peng, X., Tian, Yu., Kapadia, M., Metaxas, D.: Learning to forecast and refine residual motion for image-to-video generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 403–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_24
Ackownledgement
We thank Carsten Stoll and Christoph Lassner for the helpful feedback. We are also very grateful for the discussion within the BAIR community.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 80700 KB)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cao, Z., Gao, H., Mangalam, K., Cai, QZ., Vo, M., Malik, J. (2020). Long-Term Human Motion Prediction with Scene Context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-58452-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)