Abstract
One of the most challenging tasks in computer vision is human action recognition. The recent development of depth sensors has created new opportunities in this field of research. In this paper, a novel supervised spatio-temporal kernel descriptor (SSTKDes) is proposed from RGB-depth videos to establish a discriminative and compact feature representation of actions. To enhance the descriptive and discriminative ability of the descriptor, extracted primary kernel-based features are transformed into a new space by exploiting a supervised training strategy; i.e., large margin nearest neighbor (LMNN). The LMNN highly reduces the error of a nearest neighbor classifier by minimizing the intra-class variations and maximizing the inter-class distances. Subsequently, the efficient match kernel (EMK) is used to abstract the mid-level kernel features for a more efficient classification. The proposed approach is evaluated on five public benchmark datasets. The experimental evaluations demonstrate that the proposed method achieves superior performance to the state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16
Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80
Asadi-Aghbolaghi M, Kasaei S (2014) View invariant human action recognition using fourier-based and radon-based point cloud analysis. In: 2014 7th international symposium on telecommunications (IST). IEEE, pp 66–71
Asadi-Aghbolaghi M, Ramezanpour S, Kasaei S (2014) A new feature descriptor for 3d human action recognition. In: 2014 22nd Iranian conference on electrical engineering (ICEE). IEEE, pp 1157–1161
Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In: Advances in neural information processing systems, pp 244–252
Bo L, Sminchisescu C (2009) Efficient match kernel between sets of features for visual recognition. In: Advances in neural information processing systems, pp 135–143
Boureau YL, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 111–118
Brown M, Hua G, Winder S (2011) Discriminative learning of local image descriptors. IEEE Trans Pattern Anal Mach Intell 33(1):43–57
Calonder M, Lepetit V, Strecha C, Fua P (2010) Brief: binary robust independent elementary features. In: European conference on computer vision. Springer, pp 778–792
Chaaraoui AA, Padilla-López JR, Climent-Pérez P, Flórez-Revuelta F (2014) Evolutionary joint selection to improve human action recognition with rgb-d devices. Expert systems with applications 41(3):786–794
Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on applications of computer vision. IEEE, pp 1092–1099
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893
Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2013) Space-time pose representation for 3d human action recognition. In: International conference on image analysis and processing. Springer, pp 456–464
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition , pp 1110–1118
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. arXiv:160406573
Gu Y, Do H, Ou Y, Sheng W (2012) Human gesture recognition through a kinect sensor. In: 2012 IEEE international conference on robotics and biomimetics (ROBIO). IEEE, pp 1379–1384
Gupta A, Martinez J, Little JJ, Woodham RJ (2014) 3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2601–2608
Han F, Reily B, Hoff W, Zhang H (2016) Space-time representation of people based on 3d skeletal data: a review. arXiv:160101006
Jafari R, Ziou D (2012) Gaze estimation using kinect/ptz camera. In: 2012 IEEE international symposium on robotic and sensors environments (ROSE). IEEE, pp 13–18
Junejo IN, Dexter E, Laptev I, Perez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33(1):172–185
Kang SM, Wildes RP, 2016 Review of action recognition and detection methods. arXiv:161006906
Kong Y, Satarboroujeni B, Fu Y (2016) Learning hierarchical 3d kernel descriptors for rgb-d action recognition. Comput Vis Image Underst 144:14–23
Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO). IEEE, pp 1975–1979
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops. IEEE, pp 9–14
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833
Liu Z, Feng X, Tian Y (2015) An effective view and time-invariant action recognition method based on depth videos. In: 2015 visual communications and image processing (VCIP). IEEE, pp 1–4
Lu C, Jia J, Tang CK (2014) Range-sample depth feature for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 772–779
Oreifej O, Liu Z (2013) Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition , pp 716–723
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision. Springer , pp 742–757
Reyes M, Domínguez G, Escalera S (2011) Featureweighting in dynamic timewarping for gesture recognition in depth data. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE, pp 1182–1188
Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Commun ACM 56(1):116–124
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Sung J, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE international conference on robotics and automation (ICRA). IEEE, pp 842–849
Tehrani AKN, Aghbolaghi MA, Kasaei S (2017) Skeleton-based human action recognition - a learning method based on active joints. In: Proceedings of the 12th international joint conference on computer vision, imaging and computer graphics theory and applications - vol 5: VISAPP (VISIGRAPP 2017), pp 303–310
Varol G, Laptev I, Schmid C (2016) Long-term temporal convolutions for action recognition. arXiv:160404494
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Vieira AW, Nascimento ER, Oliveira GL, Liu Z, Campos MF (2012) Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Springer, pp 252–259
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012a) Robust 3d action recognition with random occupancy patterns. In: Computer vision–ECCV 2012. Springer, pp 872–885
Wang J, Liu Z, Wu Y, Yuan J (2012b) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1290–1297
Wang P, Wang J, Zeng G, Xu W, Zha H, Li S (2013) Supervised kernel descriptors for visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2858–2865
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2015a) Action recognition from depth maps using deep convolutional neural networks
Wang X, Farhadi A, Gupta A (2015b) Actions ∼ transformations. arXiv:151200795
Wei P, Zhao Y, Zheng N, Zhu SC (2013) Modeling 4d human-object interactions for event and object recognition. In: 2013 IEEE international conference on computer vision. IEEE, pp 3272–3279
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(Feb):207–244
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115 (2):224–241
Wu D, Pigou L, Kindermans P J, Nam L, Shao L, Dambre J, Odobez JM (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition
Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2834–2841
Xia L, Chen CC, Aggarwal J (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 20–27
Xiao Y, Xia L (2016) Human action recognition using modified slow feature analysis and multiple kernel learning. Multimedia Tools and Applications 75(21):13,041–13,056
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 804–811
Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 14–19
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM international conference on multimedia. ACM, pp 1057–1060
Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Time-of-flight and depth imaging. Sensors, algorithms, and applications. Springer, pp 149– 187
Yu S, Cheng Y, Su S, Cai G, Li S (2016) Stratified pooling based deep convolutional neural networks for human action recognition. Multimedia Tools and Applications, pp 1–16
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Asadi-Aghbolaghi, M., Kasaei, S. Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed Tools Appl 77, 14115–14135 (2018). https://doi.org/10.1007/s11042-017-5017-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5017-y