Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

M2AST:MLP-mixer-based adaptive spatial-temporal graph learning for human motion prediction

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Human motion prediction is a challenging task in human-centric computer vision, involving forecasting future poses based on historical sequences. Despite recent progress in modeling spatial-temporal relationships of motion sequences using complex structured graphs, few approaches have provided an adaptive and lightweight representation for varying graph structures of human motion. Taking inspiration from the advantages of MLP-Mixer, a lightweight architecture designed for learning complex interactions in multi-dimensional data, we explore its potential as a backbone for motion prediction. To this end, we propose a novel MLP-Mixer-based adaptive spatial-temporal pattern learning framework (M\(^2\)AST). Our framework includes an adaptive spatial mixer to model the spatial relationships between joints, an adaptive temporal mixer to learn temporal smoothness, and a local dynamic mixer to capture fine-grained cross-dependencies between joints of adjacent poses. The final method achieves a compact representation of human motion dynamics by adaptively considering spatial-temporal dependencies from coarse to fine. Unlike the trivial spatial-temporal MLP-Mixer, our proposed approach can more effectively capture both local and global spatial-temporal relationships simultaneously. We extensively evaluated our proposed framework on three commonly used benchmarks (Human3.6M, AMASS, 3DPW MoCap), demonstrating comparable or better performance than existing state-of-the-art methods in both short and long-term predictions, despite having significantly fewer parameters. Overall, our proposed framework provides a novel and efficient solution for human motion prediction with adaptive graph learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

All data supporting the findings of this study are publicly available.

Notes

  1. https://pytorch.org/.

  2. https://colab.research.google.com/.

References

  1. Mozaffari, S., Al-Jarrah, O.Y., Dianati, M., Jennings, P.A., Mouzakitis, A.: Deep learning-based vehicle behavior prediction for autonomous driving applications: a review. IEEE Trans. Intell. Transp. Syst. 23(1), 33–47 (2022)

    Article  Google Scholar 

  2. Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., Leal-Taixé, L.: Motchallenge: A benchmark for single-camera multiple target tracking. Int. J. Comput. Vision 129(4), 845–881 (2021)

    Article  Google Scholar 

  3. Conte, D., Furukawa, T.: Autonomous robotic escort incorporating motion prediction and human intention. In: IEEE International Conference on Robotics and Automation, ICRA, pp. 3480–3486 (2021)

  4. Cheng, Y., Sun, L., Liu, C., Tomizuka, M.: Towards efficient human-robot collaboration with robust plan recognition and trajectory prediction. IEEE Robotics and Automation Letters 5(2), 2602–2609 (2020)

    Article  Google Scholar 

  5. Leonardos, S., Zhou, X., Daniilidis, K.: Articulated motion estimation from a monocular image sequence using spherical tangent bundles. In: IEEE International Conference on Robotics and Automation, ICRA, pp. 587–593 (2016)

  6. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4674–4683 (2017)

  7. Li, C., Zhang, Z., Lee, W.S., Lee, G.H.: Convolutional sequence to sequence model for human dynamics. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5226–5234 (2018)

  8. Liu, Z., Wu, S., Jin, S., Liu, Q., Lu, S., Zimmermann, R., Cheng, L.: Towards natural and accurate future motion prediction of humans and animals. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 10004–10012 (2019)

  9. Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: IEEE International Conference on Computer Vision, ICCV, pp. 9488–9496 (2019)

  10. Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 211–220 (2020)

  11. Liu, X., Yin, J.: SDMTL: semi-decoupled multi-grained trajectory learning for 3d human motion prediction. CoRR arXiv: 2010.05133 (2020)

  12. Liu, X., Yin, J., Li, J., Ding, P., Liu, J., Liu, H.: Trajectorycnn: A new spatio-temporal feature learning network for human motion prediction. IEEE Trans. Circuits Syst. Video Technol. 31(6), 2133–2146 (2021)

    Article  Google Scholar 

  13. Mao, W., Liu, M., Salzmann, M.: History repeats itself: Human motion prediction via motion attention. In: European Conference on Computer Vision ECCV, vol. 12359, pp. 474–489 (2020)

  14. Su, P., Liu, Z., Wu, S., Zhu, L., Yin, Y., Shen, X.: Motion prediction via joint dependency modeling in phase space. In: ACM Multimedia, pp. 713–721 (2021)

  15. Liu, Z., Su, P., Wu, S., Shen, X., Chen, H., Hao, Y., Wang, M.: Motion prediction using trajectory cues. In: IEEE International Conference on Computer Vision, ICCV, pp. 13279–13288 (2021)

  16. Lyu, K., Chen, H., Liu, Z., Zhang, B., Wang, R.: 3d human motion prediction: A survey. CoRR arXiv: 2203.01593 (2022)

  17. Gu, C., Zhao, Y., Zhang, C.: Learning to predict diverse human motions from a single image via mixture density networks. Knowledge Based System 253, 109549 (2022)

    Article  Google Scholar 

  18. Yadav, G.K., Abdel-Nasser, M., Rashwan, H.A., Puig, D., Nandi, G.: Implicit regularization of a deep augmented neural network model for human motion prediction. Applied Intelligence, 1–14 (2023)

  19. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3d human motion modelling. In: IEEE International Conference on Computer Vision, ICCV, pp. 7143–7152 (2019)

  20. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: IEEE International Conference on Computer Vision, ICCV, pp. 4346–4354 (2015)

  21. Gui, L., Wang, Y., Liang, X., Moura, J.M.F.: Adversarial geometry-aware human motion prediction. In: European Conference on Computer Vision ECCV, pp. 823–842 (2018)

  22. Gopalakrishnan, A., Mali, A.A., Kifer, D., Giles, C.L., II, A.G.O.: A neural temporal model for human motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 12116–12125 (2019)

  23. Guo, X., Choi, J.: Human motion prediction via learning local structure representations and temporal dependencies. In: AAAI Conference on Artificial Intelligence, AAAI, pp. 2580–2587 (2019)

  24. Yu, Y., Tian, N., Hao, X., Ma, T., Yang, C.: Human motion prediction with gated recurrent unit model of multi-dimensional input. Applied Intelligence, 1–13 (2022)

  25. Cui, Q., Sun, H., Yang, F.: Learning dynamic relationships for 3d human motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6518–6526 (2020)

  26. Cui, Q., Sun, H.: Towards accurate 3d human motion prediction from incomplete observations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4801–4810 (2021)

  27. Li, Q., Chalvatzaki, G., Peters, J., Wang, Y.: Directed acyclic graph neural network for human motion prediction. In: IEEE International Conference on Robotics and Automation, ICRA, pp. 3197–3204 (2021)

  28. Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Multiscale spatio-temporal graph neural networks for 3d skeleton-based motion prediction. IEEE Trans. Image Process. 30, 7760–7775 (2021)

    Article  MathSciNet  Google Scholar 

  29. Zhou, H., Guo, C., Zhang, H., Wang, Y.: Learning multiscale correlations for human motion prediction. In: IEEE International Conference on Development and Learning, ICDL, pp. 1–7 (2021)

  30. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp. 7444–7452 (2018)

  31. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3316–3333 (2022)

    Article  Google Scholar 

  32. Mao, W., Liu, M., Salzmann, M., Li, H.: Multi-level motion attention for human motion prediction. Int. J. Comput. Vision 129(9), 2513–2535 (2021)

    Article  Google Scholar 

  33. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A.: Mlp-mixer: An all-mlp architecture for vision. In: Advances in Neural Information Processing Systems. NeurIPS, pp. 24261–24272 (2021)

  34. Cui, Q., Sun, H., Kong, Y., Zhang, X., Li, Y.: Efficient human motion prediction using temporal convolutional generative adversarial network. Information Science 545, 427–447 (2021)

    Article  Google Scholar 

  35. Józefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: Bach, F.R., Blei, D.M. (eds.) International Conference on Machine Learning, ICML, 37, 2342–2350 (2015)

  36. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-rnn: Deep learning on spatio-temporal graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5308–5317 (2016)

  37. Corona, E., Pumarola, A., Alenyà, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 6990–6999 (2020)

  38. Azencot, O., Erichson, N.B., Lin, V., Mahoney, M.W.: Forecasting sequential data using consistent koopman autoencoders. In: International Conference on Machine Learning, ICML, pp. 475–485 (2020)

  39. Li, B., Tian, J., Zhang, Z., Feng, H., Li, X.: Multitask non-autoregressive model for human motion prediction. IEEE Trans. Image Process. 30, 2562–2574 (2021)

    Article  Google Scholar 

  40. Pavllo, D., Grangier, D., Auli, M.: Quaternet: A quaternion-based recurrent model for human motion. In: British Machine Vision Conference BMVC, p. 299 (2018)

  41. Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW, pp. 1623–1631 (2017)

  42. Tang, J., Chen, J., Su, Y., Xing, M., Zhu, S.: Mtan: Multi-degree tail-aware attention network for human motion prediction. Internet of Things, 101134 (2024)

  43. Li, C., Zhang, Z., Lee, W.S., Lee, G.H.: Convolutional sequence to sequence model for human dynamics. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5226–5234 (2018)

  44. Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3569–3577 (2018)

  45. Sofianos, T., Sampieri, A., Franco, L., Galasso, F.: Space-time-separable graph convolutional network for pose forecasting. In: IEEE International Conference on Computer Vision, ICCV, pp. 11189–11198 (2021)

  46. Zhong, C., Hu, L., Zhang, Z., Ye, Y., Xia, S.: Spatial-temporal gating-adjacency GCN for human motion prediction. CoRR (2022) https://doi.org/10.48550/arXiv.2203.01474

  47. Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 100(1), 90–93 (1974)

    Article  MathSciNet  Google Scholar 

  48. Guo, W., Du, Y., Shen, X., Lepetit, V., Alameda-Pineda, X., Moreno-Noguer, F.: Back to MLP: A simple baseline for human motion prediction. In: IEEE Winter Conference on Applications of Computer Vision, WACV, pp. 4798–4808 (2023)

  49. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, NeurIPS, pp. 5998–6008 (2017)

  50. Cai, Y., Huang, L., Wang, Y., Cham, T., Cai, J., Yuan, J., Liu, J., Yang, X., Zhu, Y., Shen, X., Liu, D., Liu, J., Magnenat-Thalmann, N.: Learning progressive joint propagation for human motion prediction. In: European Conference on Computer Vision ECCV, pp. 226–242 (2020)

  51. Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3d human motion prediction. In: International Conference on 3D Vision, 3DV, pp. 565–574 (2021)

  52. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 7132–7141 (2018)

  53. Misra, D.: Mish: A self regularized non-monotonic activation function. In: British Machine Vision Conference, BMVC (2020)

  54. Chen, S., Xie, E., Ge, C., Chen, R., Liang, D., Luo, P.: Cyclemlp: A mlp-like architecture for dense prediction. In: International Conference on Learning Representations, ICLR (2022)

  55. Dang, L., Nie, Y., Long, C., Zhang, Q., Li, G.: MSR-GCN: multi-scale residual graph convolution networks for human motion prediction. In: IEEE International Conference on Computer Vision, ICCV, pp. 11447–11456 (2021)

  56. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)

    Article  Google Scholar 

  57. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: IEEE International Conference on Computer Vision, ICCV, pp. 5441–5450 (2019)

  58. Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: European Conference on Computer Vision ECCV, pp. 614–631 (2018)

  59. Bouazizi, A., Holzbock, A., Kressel, U., Dietmayer, K., Belagiannis, V.: Motionmixer: Mlp-based 3d human body pose forecasting. In: International Joint Conference on Artificial Intelligence, IJCAI, pp. 791–798 (2022)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Junyi Tang: J.T. was chiefly involved in the programming aspect of the experiments. He implemented algorithms, tested the code, and ensured its efficiency in producing the desired motion prediction outcomes. He also contributed to the visualization of results, creating figures and tables to encapsulate the findings. Simin An: S.A. provided strategic guidance and contributed to idea development and research structuring. Together with Y.S., she played a pivotal role in writing, revising, and finalizing the manuscript. S.A. ensured that the paper adhered to academic standards and effectively communicated the research findings to the broader scientific community. Yuanwei Liu: Y.L. worked in tandem with J.T. and J.C., focusing on coding, debugging, and refining the motion prediction algorithms. He actively participated in the implementation of algorithms, code debugging, and optimization processes to enhance the efficiency and accuracy of the models. Y.L. also played a crucial role in generating visualizations and organizing experimental results into tables, facilitating the interpretation and dissemination of the research outcomes. Yong Su: Y.S. led the conceptualization and theoretical framing of the study, contributing original ideas and shaping the research direction. He collaborated with S.A. and other team members in writing, revising, and finalizing the manuscript, ensuring that the paper adhered to scientific principles and addressed key research themes comprehensively. Jin Chen: J.C. collaborated closely with J.T. and Y.L. on the programming and implementation aspects of the experiments. He contributed to algorithm development, code testing, and optimization efforts to ensure the effectiveness of the motion prediction models. Additionally, J.C. played a significant role in visualizing the results, creating figures, and crafting tables to present the experimental findings comprehensively.

Corresponding author

Correspondence to Simin An.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by R. Huang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Figs. 9, 10.

Fig. 9
figure 9

Visualization of the structure of Local Dynamic Mixer (LDM), Adaptive Spatial Mixer (ASM), and Adaptive Temporal Mixer (ATM)

Fig. 10
figure 10

T-SNE visualization of performance for three actions of H3.6M

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, J., An, S., Liu, Y. et al. M2AST:MLP-mixer-based adaptive spatial-temporal graph learning for human motion prediction. Multimedia Systems 30, 206 (2024). https://doi.org/10.1007/s00530-024-01351-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01351-7

Keywords

Navigation