Multi-modality learning for human action recognition

Ziliang Ren^1,2,
Qieshi Zhang^1,2,
Xiangyang Gao^1,2,
Pengyi Hao³ &
…
Jun Cheng ORCID: orcid.org/0000-0002-3131-3275^1,2

1806 Accesses
33 Citations
Explore all metrics

Abstract

The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequence effectively. In this paper, to obtain better representation of spatial-temporal information, we propose a bidirectional rank pooling method to construct the RGB Visual Dynamic Images (VDIs) and Depth Dynamic Images (DDIs). Furthermore, we design an effective segmentation convolutional networks (ConvNets) architecture based on multi-modality hierarchical fusion strategy for human action recognition. The proposed method has been verified and achieved the state-of-the-art results on the widely used NTU RGB+D, SYSU 3D HOI and UWA3D II datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Multimodal vision-based human action recognition using deep learning: a review

Article Open access 19 June 2024

References

Asadi-Aghbolaghi M, Kasaei S (2018) Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed Tools Appl 77(11):14115–14135
Article Google Scholar
Baradel F, Wolf C, Mille J (2018) Human activity recognition with pose-driven attention to RGB. In: British machine vision conference (BMVC), pp 1–14
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3034–3042
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 248–255
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39 (4):677–691
Article Google Scholar
Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39(4):773–787
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778
Hu J, Zheng W, Lai J, Zhang J (2017) Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans Pattern Anal Mach Intellgence (TPAMI) 39(11):2186–2200
Article Google Scholar
Ijjina EP, Chalavadi KM (2017) Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recogn 72:504–516
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate SHIFT. In: 32nd International conference on machine learning (ICML), vol 1, pp 448– 456
Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: IEEE International conference on multimedia and expo workshops (ICMEW), pp 1–6
Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl-Based Syst 122:64–74
Article Google Scholar
Ji X, Cheng J, Feng W, Tao D (2018) Skeleton embedded motion body partition for human action recognition using depth sequences. Signal Process 143:56–68
Article Google Scholar
Jiang Y, Dai Q, Liu W, Xue X, Ngo C (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process (TIP) 24(11):3781–3795
Article MathSciNet Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1725–1732
Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett 115:107–116
Article Google Scholar
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for RGB-D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1054–1062
Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International joint conference on artificial intelligence (IJCAI), pp 786–792
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: European conference on computer vision (ECCV), vol 9907, pp 816–833
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100
Article Google Scholar
Liu J, Wang G, Hu P, Duan L, Kot AC (2017) Global context-aware attention LSTM networks for 3D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3671–3680
Moghaddam Z, Piccardi M (2014) Training initialization of hidden Markov models in human action recognition. IEEE Trans Autom Sci Eng (TASE) 11(2):394–408
Article Google Scholar
Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1506–1515
Rahmani H, Mahmood A, Huynh D, Mian A (2016) Histogram of oriented principal components for cross-view action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38(12):2430–2443
Article Google Scholar
Sempena S, Maulidevi N, Aryan P (2011) Human action recognition using dynamic time warping. In: International conference on electrical engineering and informatics (ICEEI), pp 1–5
Shahroudy A, Liu J, Ng T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1010–1019
Shahroudy A, Ng T, Gong Y, Wang G (2018) Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 40(5):1045–1058
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), vol 1, pp 568–576
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Article MathSciNet Google Scholar
Sun L, Jia K, Yeung D, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4597–4605
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4489–4497
Veeriah V, Zhuang N, Qi G (2015) Differential recurrent neural networks for action recognition. In: IEEE International conference on computer vision (ICCV), pp 4041–4049
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36 (5):914–927
Article Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European Conference on computer vision (ECCV), vol 9912, pp 20–36
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Syst (THMS) 46(4):498–509
Article Google Scholar
Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: a new representation for RGB-D based action recognition with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 416–425
Wang P, Li W, Wan J, Ogunbona P, Liu X (2018) Cooperative training of deep aggregation networks for RGB-D action recognition. In: 32nd AAAI Conference on artificial intelligence (AAAI), pp 7404–7411
Wang P, Li W, Gao Z, Tang C, Ogunbona P (2018) Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans Multimed (TMM) 20(5):1051–1061
Article Google Scholar
Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inform Sci 480:287–304
Article Google Scholar
Zhang K, Zhang L (2018) Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77(13):16053–16068
Article Google Scholar
Zhang J, Li W, Ogunbona P, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (2018YFB 1308000), National Natural Science Funds of China (U1913202,U1813205, U1713213, 61772508, 61801428), Shenzhen Technology Project (JCYJ20180507182610734, JCYJ20170413152535587), CAS Key Technology Talent Program, Zhejiang Provincial Natural Science Foundation of China (LY18F020034).

Author information

Authors and Affiliations

CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Ziliang Ren, Qieshi Zhang, Xiangyang Gao & Jun Cheng
The Chinese University of Hong Kong, Hong Kong, China
Ziliang Ren, Qieshi Zhang, Xiangyang Gao & Jun Cheng
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China
Pengyi Hao

Authors

Ziliang Ren
View author publications
You can also search for this author in PubMed Google Scholar
Qieshi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Pengyi Hao
View author publications
You can also search for this author in PubMed Google Scholar
Jun Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Cheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ziliang Ren and Qieshi Zhang contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ren, Z., Zhang, Q., Gao, X. et al. Multi-modality learning for human action recognition. Multimed Tools Appl 80, 16185–16203 (2021). https://doi.org/10.1007/s11042-019-08576-z

Download citation

Received: 10 June 2019
Revised: 15 October 2019
Accepted: 11 December 2019
Published: 02 March 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11042-019-08576-z

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Multimodal vision-based human action recognition using deep learning: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-modality learning for human action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Multimodal vision-based human action recognition using deep learning: a review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation