Long-Term Human Motion Prediction with Scene Context

Zhe Cao¹²,
Hang Gao¹²,
Karttikeya Mangalam¹²,
Qi-Zhi Cai¹³,
Minh Vo^12,13 &
…
Jitendra Malik¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Included in the following conference series:

European Conference on Computer Vision

17k Accesses
118 Citations

Abstract

Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment – imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods. Project page: https://people.eecs.berkeley.edu/~zhecao/hmp/index.html (Please refer to our arXiv for a longer version of the paper with more visualizations.)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Scene-Aware Human Motion Forecasting via Mutual Distance Prediction

Path-Guided Motion Prediction with Multi-view Scene Perception

Class-guided human motion prediction via multi-spatial-temporal supervision

Article 10 March 2023

Notes

1.
We choose to represent the scene by RGB images rather than RGBD scans because they are more readily available in many practical applications.
2.
Dataset available in https://github.com/ZheC/GTA-IM-Dataset

References

CMU Motion Capture Database. http://mocap.cs.cmu.edu
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: NIPS (2009)
Google Scholar
Akhter, I., Simon, T., Khan, S., Matthews, I., Sheikh, Y.: Bilinear spatiotemporal basis models. SIGGRAPH (2012)
Google Scholar
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)
Google Scholar
Alahi, A., Ramanathan, V., Fei-Fei, L.: Socially-aware large-scale crowd forecasting. In: CVPR (2014)
Google Scholar
Alexopoulos, C., Griffin, P.M.: Path planning for a mobile robot. IEEE Trans. Syst. Man Cybern. (1992)
Google Scholar
Brand, M., Hertzmann, A.: Style machines. SIGGRAPH (2000)
Google Scholar
Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019)
Google Scholar
Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR (2017)
Google Scholar
Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In: ICCV (2019)
Google Scholar
Chiu, H.K., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: WACV (2019)
Google Scholar
Elhayek, A., Stoll, C., Hasler, N., Kim, K.I., Seidel, H.P., Theobalt, C.: Spatio-temporal motion tracking with unsynchronized cameras. In: CVPR (2012)
Google Scholar
Fabbri, M., Lanzi, F., Calderara, S., Palazzi, A., Vezzani, R., Cucchiara, R.: Learning to detect and track visible and occluded body joints in a virtual world. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 450–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_27
Chapter Google Scholar
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Google Scholar
Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 3DV (2017)
Google Scholar
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)
Google Scholar
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)
Google Scholar
Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E (1995)
Google Scholar
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: CVPR (2019)
Google Scholar
Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asian Technical Briefs (2015)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI (2013)
Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2014)
Google Scholar
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Chapter Google Scholar
Krähenbühl, P.: Free supervision from video games. In: CVPR (2018)
Google Scholar
LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006)
Google Scholar
Law, H., Teng, Y., Russakovsky, O., Deng, J.: CornerNet-Lite: efficient keypoint based object detection. arXiv preprint arXiv:1904.08900 (2019)
Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., Kautz, J.: Context-aware synthesis and placement of object instances. In: NIPS (2018)
Google Scholar
Lerner, A., Chrysanthou, Y., Lischinski, D.: Crowds by example. In: CGF (2007)
Google Scholar
Li, C., Zhang, Z., Sun Lee, W., Hee Lee, G.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)
Google Scholar
Li, X., Liu, S., Kim, K., Wang, X., Yang, M.H., Kautz, J.: Putting humans in a scene: learning affordance in 3D indoor environments. In: CVPR (2019)
Google Scholar
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR (2018)
Google Scholar
Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: CVPR (2017)
Google Scholar
Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In: CVPR (2019)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Chapter Google Scholar
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Google Scholar
Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.J.: iMapper: interaction-guided joint scene and human motion mapping from monocular videos. SIGGRAPH (2019)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Google Scholar
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. In: BMVC (2018)
Google Scholar
Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: NIPS (2001)
Google Scholar
Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: modeling social behavior for multi-target tracking. In: CVPR (2009)
Google Scholar
Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. In: CVPR (2019)
Google Scholar
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: Learning Interaction Snapshots from Observations. TOG (2016)
Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Chapter Google Scholar
Tai, L., Zhang, J., Liu, M., Burgard, W.: Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In: ICRA (2018)
Google Scholar
Tay, M.K.C., Laugier, C.: Modelling smooth paths using gaussian processes. In: Laugier, C., Siegwart, R. (eds.) Field and Service Robotics, pp. 381–390. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-75404-6_36
Chapter Google Scholar
Treuille, A., Cooper, S., Popović, Z.: Continuum crowds. TOG (2006)
Google Scholar
Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)
Google Scholar
Vo, M., Narasimhan, S.G., Sheikh, Y.: Spatiotemporal bundle adjustment for dynamic 3D reconstruction. In: CVPR (2016)
Google Scholar
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: CVPR (2017)
Google Scholar
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. TPAMI (2007)
Google Scholar
Wang, J.M., Fleet, D.J., Hertzmann, A.: Multifactor gaussian process models for style-content separation. In: ICML (2007)
Google Scholar
Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)
Google Scholar
Wang, Z., Chen, L., Rathore, S., Shin, D., Fowlkes, C.: Geometric pose affordance: 3D human pose with scene constraints. arXiv preprint arXiv:1905.07718 (2019)
Wang, Z., Shin, D., Fowlkes, C.C.: Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation. arXiv preprint arXiv:2004.03143 (2020)
Wei, M., Miaomiao, L., Mathieu, S., Hongdong, L.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Google Scholar
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Google Scholar
Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. IROS (2018)
Google Scholar
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: ICCV (2019)
Google Scholar
Zhao, L., Peng, X., Tian, Yu., Kapadia, M., Metaxas, D.: Learning to forecast and refine residual motion for image-to-video generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 403–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_24
Chapter Google Scholar

Download references

Ackownledgement

We thank Carsten Stoll and Christoph Lassner for the helpful feedback. We are also very grateful for the discussion within the BAIR community.

Author information

Authors and Affiliations

UC Berkeley, Berkeley, USA
Zhe Cao, Hang Gao, Karttikeya Mangalam, Minh Vo & Jitendra Malik
Nanjing University, Nanjing, China
Qi-Zhi Cai & Minh Vo

Authors

Zhe Cao
View author publications
You can also search for this author in PubMed Google Scholar
Hang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Karttikeya Mangalam
View author publications
You can also search for this author in PubMed Google Scholar
Qi-Zhi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Minh Vo
View author publications
You can also search for this author in PubMed Google Scholar
Jitendra Malik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhe Cao .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2787 KB)

Supplementary material 2 (mp4 80700 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, Z., Gao, H., Mangalam, K., Cai, QZ., Vo, M., Malik, J. (2020). Long-Term Human Motion Prediction with Scene Context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-58452-8_23
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Long-Term Human Motion Prediction with Scene Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Scene-Aware Human Motion Forecasting via Mutual Distance Prediction

Path-Guided Motion Prediction with Multi-view Scene Perception

Class-guided human motion prediction via multi-spatial-temporal supervision

Notes

References

Ackownledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2787 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Long-Term Human Motion Prediction with Scene Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Scene-Aware Human Motion Forecasting via Mutual Distance Prediction

Path-Guided Motion Prediction with Multi-view Scene Perception

Class-guided human motion prediction via multi-spatial-temporal supervision

Notes

References

Ackownledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2787 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation