Abstract
The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequence effectively. In this paper, to obtain better representation of spatial-temporal information, we propose a bidirectional rank pooling method to construct the RGB Visual Dynamic Images (VDIs) and Depth Dynamic Images (DDIs). Furthermore, we design an effective segmentation convolutional networks (ConvNets) architecture based on multi-modality hierarchical fusion strategy for human action recognition. The proposed method has been verified and achieved the state-of-the-art results on the widely used NTU RGB+D, SYSU 3D HOI and UWA3D II datasets.
Similar content being viewed by others
References
Asadi-Aghbolaghi M, Kasaei S (2018) Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed Tools Appl 77(11):14115–14135
Baradel F, Wolf C, Mille J (2018) Human activity recognition with pose-driven attention to RGB. In: British machine vision conference (BMVC), pp 1–14
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3034–3042
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 248–255
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39 (4):677–691
Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39(4):773–787
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778
Hu J, Zheng W, Lai J, Zhang J (2017) Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans Pattern Anal Mach Intellgence (TPAMI) 39(11):2186–2200
Ijjina EP, Chalavadi KM (2017) Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recogn 72:504–516
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate SHIFT. In: 32nd International conference on machine learning (ICML), vol 1, pp 448– 456
Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: IEEE International conference on multimedia and expo workshops (ICMEW), pp 1–6
Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl-Based Syst 122:64–74
Ji X, Cheng J, Feng W, Tao D (2018) Skeleton embedded motion body partition for human action recognition using depth sequences. Signal Process 143:56–68
Jiang Y, Dai Q, Liu W, Xue X, Ngo C (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process (TIP) 24(11):3781–3795
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1725–1732
Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett 115:107–116
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for RGB-D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1054–1062
Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International joint conference on artificial intelligence (IJCAI), pp 786–792
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: European conference on computer vision (ECCV), vol 9907, pp 816–833
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100
Liu J, Wang G, Hu P, Duan L, Kot AC (2017) Global context-aware attention LSTM networks for 3D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3671–3680
Moghaddam Z, Piccardi M (2014) Training initialization of hidden Markov models in human action recognition. IEEE Trans Autom Sci Eng (TASE) 11(2):394–408
Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1506–1515
Rahmani H, Mahmood A, Huynh D, Mian A (2016) Histogram of oriented principal components for cross-view action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38(12):2430–2443
Sempena S, Maulidevi N, Aryan P (2011) Human action recognition using dynamic time warping. In: International conference on electrical engineering and informatics (ICEEI), pp 1–5
Shahroudy A, Liu J, Ng T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1010–1019
Shahroudy A, Ng T, Gong Y, Wang G (2018) Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 40(5):1045–1058
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), vol 1, pp 568–576
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Sun L, Jia K, Yeung D, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4597–4605
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4489–4497
Veeriah V, Zhuang N, Qi G (2015) Differential recurrent neural networks for action recognition. In: IEEE International conference on computer vision (ICCV), pp 4041–4049
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36 (5):914–927
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European Conference on computer vision (ECCV), vol 9912, pp 20–36
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Syst (THMS) 46(4):498–509
Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: a new representation for RGB-D based action recognition with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 416–425
Wang P, Li W, Wan J, Ogunbona P, Liu X (2018) Cooperative training of deep aggregation networks for RGB-D action recognition. In: 32nd AAAI Conference on artificial intelligence (AAAI), pp 7404–7411
Wang P, Li W, Gao Z, Tang C, Ogunbona P (2018) Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans Multimed (TMM) 20(5):1051–1061
Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inform Sci 480:287–304
Zhang K, Zhang L (2018) Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77(13):16053–16068
Zhang J, Li W, Ogunbona P, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Acknowledgements
This work was supported by National Key R&D Program of China (2018YFB 1308000), National Natural Science Funds of China (U1913202,U1813205, U1713213, 61772508, 61801428), Shenzhen Technology Project (JCYJ20180507182610734, JCYJ20170413152535587), CAS Key Technology Talent Program, Zhejiang Provincial Natural Science Foundation of China (LY18F020034).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ziliang Ren and Qieshi Zhang contributed equally to this work.
Rights and permissions
About this article
Cite this article
Ren, Z., Zhang, Q., Gao, X. et al. Multi-modality learning for human action recognition. Multimed Tools Appl 80, 16185–16203 (2021). https://doi.org/10.1007/s11042-019-08576-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08576-z