Abstract
In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given pixel-level features, our approach estimates video local statistics in a hierarchical manner, where the distribution of pixel-level features and that of frame-level descriptors are modeled using single Gaussians. In this way, our approach constructs video descriptors directly from low-level features without resorting to codebook learning required by Bag-of-features (BoF) based approaches. In order to capture the spatial geometry and temporal order of a video, we use a spatio-temporal pyramid representation for each video. Our approach is validated on six benchmark datasets, i.e. MSRAction3D, MSRGesture3D, DHA, SKIG, UTD-MHAD and CAD-120. The experimental results show that our approach gives good performance on all the datasets. In particular, it achieves state-of-the-art accuracies on DHA, SKIG and UTD-MHAD datasets.
Similar content being viewed by others
References
Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application to face recognition. TPAMI 28(12):2037–2041
Altun K, Barshan B (2010) Human activity recognition using Inertial/Magnetic sensor units. In: Proceedings of the first international conference on human behavior understanding, pp 38–51
Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric Positive-Definite matrices. SIAM J Matrix Anal Appl 29(1):328–347
Bilinski P, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: IJCAI, pp 2140–2147
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cavazza J, Zunino A, Biagio MS, Murino V (2016) Kernelized covariance for action recognition. In: ICPR, pp 408–413
Chaaraoui AA, Padilla-Lopez JR, Florez-Revuelta F (2013) Fusion of skeletal and Silhouette-Based features for human action recognition with RGB-d devices. In: ICCVW, pp 91–97
Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion Maps-based local binary patterns. In: WACV, pp 1092–1099
Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45(1):51– 61
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172
Chen C, Kehtarnavaz N, Jafari R (2014) A medication adherence monitoring system for pill bottles based on a wearable inertial sensor. In: 36Th annual international conference of the IEEE engineering in medicine and biology society, pp 4983–4986
Cirujeda P, Binefa X (2014) 4DCov: a nested covariance descriptor of Spatio-Temporal features for gesture recognition in depth sequences. In: 3DV, vol 1, pp 657–664
Cui J, Liu Y, Xu Y, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low- and High-Dimensional approaches. IEEE Trans Syst Man Cybern Syst Hum 43(4):996– 1002
Davis LS (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In: CVPR, pp 2496–2503
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118
Ermes M, PÄrkkÄ J, MÄntyjÄrvi J, Korhonen I (2008) Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans Inf Technol Biomed 12(1): 20–26
Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: ICPR, pp 4513–4518
Fan KC, Hung TY (2014) A novel local pattern descriptor - local vector pattern in High-Order derivative space for face recognition. IEEE Trans Image Process 23 (7):2877–2891
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(Part 2):554–564
Girshick R (2015) Fast r-CNN. In: ICCV, pp 1440–1448
Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494
Harandi MT, Sanderson C, Sanin A, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: WACV, pp 103–110
Huang Z, Wan C, Probst T, Gool LV (2017) Deep learning on lie groups for Skeleton-Based action recognition. In: CVPR
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp 2466–2472
Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-d videos. Int J Robot Res 32(8):951–970
Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI 38(1):14–29
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO, pp 1975–1979
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp 609–616
Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: CVPR
Li P, Wang Q (2012) local log-euclidean covariance matrix (l2ECM) for image representation and its applications. In: ECCV, pp 469–482
Li P, Wang Q, Zeng H, Zhang L (2017) local Log-Euclidean multivariate gaussian descriptor and its application to image classification. TPAMI 39(4):803–817
Li Q, Stankovic JA, Hanson MA, Barth AT, Lach J, Zhou G (2009) Accurate, fast fall detection using gyroscopes and Accelerometer-Derived posture information. In: Proceedings of the sixth international workshop on wearable and implantable body sensor networks, pp 138– 143
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14
Lin YC, Hu MC, Cheng WH, Hsieh YH, Chen HM (2012) Human action recognition and retrieval using sole depth information. In: ACM MM, pp 1053–1056
Liu A, Nie W, Su Y, Ma L, Hao T, Yang Z (2015) Coupled hidden conditional random fields for RGB-d human action recognition. Signal Process 112(C):74–82
Liu C, Hu Y, Li Y, Song S, Liu J (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
Liu J, Wang G, Hu P, Duan LY, Kot AC (2017) Global Context-Aware attention LSTM networks for 3D action recognition. In: CVPR
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI, pp 1266–1272
Liu L, Shao L (2013) Learning discriminative representations from RGB-d video data. In: IJCAI, pp 1493–1500
Liu M, Liu H, Chen C (2017) 3D action recognition using multi-scale energy-based global ternary image. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2017.2655521
Liu Y, Cui J, Zhao H, Zha H (2012) Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In: ICPR, pp 898–901
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. In: IJCAI, pp 1617–1623
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: Sensor-based activity recognition. Neurocomputing 181(C):108–115
Liu Y, Zhang X, Cui J, Wu C, Aghajan H, Zha H (2010) Visual analysis of child-adult interactive behaviors in video sequences. In: 2010 16Th international conference on virtual systems and multimedia, pp 26–33
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55(Part 2):93–100
Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48
Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools and Applications 76(8):10,701– 10,719
Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, pp 1809–1816
Matsukawa T, Okabe T, Suzuki E, Sato Y (2016) Hierarchical gaussian descriptor for person Re-identification. In: CVPR, pp 1363–1372
Mici L, Parisi GI, Wermter S (2017) A self-organizing neural network architecture for learning human-object interactions. arXiv:1710.01916
Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. TPAMI 27(10):1615–1630
Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: ICPR, vol 1, pp 582–585
Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723
Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: ICCV
Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: CVPR, pp 1506–1515
Rezazadegan F, Shirazi S, Upcroft B, Milford M (2017) Action recognition: From static datasets to moving robots. arXiv:1701.04925
Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. IJCV 105(3):222–245
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB + d: a large scale dataset for 3D human activity analysis. In: CVPR, pp 1010–1019
Shi Z, Kim TK (2017) Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: CVPR
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. TPAMI 30(10):1713–1727
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM, pp 1469–1472
Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: ECCV, pp 872–885
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297
Wang L, Zhang J, Zhou L, Tang C, Li W (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV, pp 4570–4578
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498– 509
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106
Wang P, Wang S, Gao Z, Hou Y, Li W (2017) Structured images for RGB-d action recognition. In: ICCV workshop
Wang Q, Li P, Zhang L, Zuo W (2016) Towards effective Codebookless model for image classification. Pattern Recogn 59(C):63–71
Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR, pp 2834–2841
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811
Yang X, Tian YL (2012) EigenJoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060
Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. The Visual Computer. https://doi.org/10.1007/s00371-016-1345-6
Yu M, Liu L, Shao L (2016) Structure-Preserving binary representations for RGB-d action recognition. TPAMI 38(8):1651–1664
Yuan C, Hu W, Li X, Maybank S, Luo G (2010) Human action recognition under log-euclidean riemannian metric. In: ACCV, pp 343–353
Zhang B, Yang Y, Chen C, Yang L, Han J, Shao L (2017) Action recognition using 3D histograms of texture and a Multi-Class boosting classifier. IEEE Trans Image Process 26(10):4648– 4660
Zhang C, Tian Y (2015) Histogram of 3D facets. CVIU 139(C):29–39
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. TPAMI 29(6):915–928
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition. In: DICTA, pp 1–8
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3D action recognition. In: CVPRW, pp 486–491
Acknowledgments
Portions of the research in this paper use the DHA video dataset collected by Research Center for Information Technology Innovation (CITI), Academia Sinica.
Author information
Authors and Affiliations
Corresponding author
Appendix: Example to Calculate Binary Patterns for Shape And Motion Cues
Appendix: Example to Calculate Binary Patterns for Shape And Motion Cues
An example to calculate LVPshape,α,D(Gr) with α = 00 and D = 1 is given in Fig. 9, where the referenced pixel Gr is marked in red with its depth value. The first out of 8 bits of LVPshape,α,D(Gr) is calculated from \(I^{\prime }_{\alpha ,D}(G_{1,r})\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})\), \(I^{\prime }_{\alpha ,D}(G_{r})\) and \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})\). Since \(I^{\prime }_{\alpha ,D}(G_{1,r})= 7\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})= 1\), \(I^{\prime }_{\alpha ,D}(G_{r})=-3\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})= 1\) and \(1 - \frac {1}{-3} \times 7 > 0\), the first bit of \(LVP_{shape,\alpha ,D}(G_{r})\) is assigned to 1. Using similar calculations, the obtained binary code LVPshape,α,D(Gr) is 11010100, which is 212 in decimal form.
Another example to calculate \(LVP_{motion,D_{1},D_{2}}(G_{r})\) with D1 = − 1 and D2 = 1 is given in Fig. 10. The first out of 8 bits of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is calculated from \(I^{\prime }_{D_{1}}(G_{1,r})= 2\), \(I^{\prime }_{D_{2}}(G_{1,r})= 1\), \(I^{\prime }_{D_{1}}(G_{r})= 3\) and \(I^{\prime }_{D_{2}}(G_{r})=-1\). Since \(1 - \frac {-1}{3} \times 2 > 0\), the first bit of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is assigned to 1. The obtained binary code \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is 10101101, which is 173 in decimal form.
Rights and permissions
About this article
Cite this article
Nguyen, X.S., Mouaddib, AI., Nguyen, T.P. et al. Action recognition in depth videos using hierarchical gaussian descriptor. Multimed Tools Appl 77, 21617–21652 (2018). https://doi.org/10.1007/s11042-017-5593-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5593-x