research-article

Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition

Authors:

Juan J. Pantrigo,

Antonio S. Montemayor,

Jos F. VlezAuthors Info & Claims

Pattern Recognition, Volume 76, Issue C

Pages 80 - 94

https://doi.org/10.1016/j.patcog.2017.10.033

Published: 01 April 2018 Publication History

Abstract

Combination of a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) recurrent network for skeleton-based human activity and hand gesture recognition.Two-stage training strategy which firstly focuses on the CNN training and, secondly, adjusts the full method CNN+LSTM.A method for data augmentation in the context of spatiotemporal 3D data sequences.An exhaustive experimental study on publicly available data benchmarks with respect to the state-of-the-art most representative methods.Comparison among different CPU and GPU platforms. In this work, we address human activity and hand gesture recognition problems using 3D data sequences obtained from full-body and hand skeletons, respectively. To this aim, we propose a deep learning-based approach for temporal 3D pose recognition problems based on a combination of a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) recurrent network. We also present a two-stage training strategy which firstly focuses on CNN training and, secondly, adjusts the full method (CNN+LSTM). Experimental testing demonstrated that our training method obtains better results than a single-stage training strategy. Additionally, we propose a data augmentation method that has also been validated experimentally. Finally, we perform an extensive experimental study on publicly available data benchmarks. The results obtained show how the proposed approach reaches state-of-the-art performance when compared to the methods identified in the literature. The best results were obtained for small datasets, where the proposed data augmentation strategy has greater impact.

References

[1]

A. Farooq, C.S. Won, A survey of human action recognition approaches that use an RGB-D sensor, 2015.

[2]

P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P. Ogunbona, Deep convolutional neural networks for action recognition using depth map sequences, Comput. Res. Repository (CoRR), abs/1501.04686 (2015).

[3]

S. Escalera, X. Baro, J. Gonzalez, M. Bautista, M. Madadi, M. Reyes, V. Ponce, H. Escalante, J. Shotton, I. Guyon, ChaLearn Looking at People Challenge 2014: Dataset and Results, vol. 8925, Lecture Notes in Computer Science, pp. 459473.

[4]

A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu RGB+D: a large scale dataset for 3d human activity analysis, 2016.

[5]

J. Zhang, W. Li, P.O. Ogunbona, P. Wang, C. Tang, RGB-D-based action recognition datasets: a survey, Pattern Recognit., 60 (2016) 86-105.

Digital Library

[6]

L.L. Presti, M.L. Cascia, 3D skeleton-based human action classification: a survey, Pattern Recognit., 53 (2016) 130-147.

Digital Library

[7]

L. Xia, C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of 3d joints, 2012.

[8]

M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection, 2013.

[9]

M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, A. Bimbo, Space-time pose representation for 3d human action recognition, Springer-Verlag New York, Inc., New York, NY, USA, 2013.

[10]

A. Chrungoo, S.S. Manimaran, B. Ravindran, Activity Recognition for Natural Human Robot Interaction, vol. 8755, Springer International Publishing, Cham, pp. 8494.

[11]

R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d skeletons as points in a lie group, 2014.

[12]

G. Evangelidis, G. Singh, R. Horaud, Skeletal quads: human action recognition using joint quadruples, 2014.

[13]

H. Zhang, L.E. Parker, Bio-inspired predictive orientation decomposition of skeleton trajectories for real-time human activity prediction, 2015.

[14]

L. Tao, R. Vidal, Moving poselets: a discriminative and interpretable skeletal motion representation for action recognition, 2015.

[15]

C. Coppola, O. Martinez Mozos, N. Bellotto, Applying a 3d qualitative trajectory calculus to human action recognition using depth cameras, IEEE, 2015.

[16]

W. Ding, K. Liu, F. Cheng, J. Zhang, STFC: spatio-temporal feature chain for skeleton-based human action recognition, J. Vis. Commun. Image Represent., 26 (2015) 329-337.

Digital Library

[17]

B.A. Boulbaba, J. Su, S. Anuj, Action recognition using rate-invariant analysis of skeletal shape trajectories, IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016) 1-13.

[18]

G. Zhu, L. Zhang, P. Shen, J. Song, An online continuous human action recognition algorithm based on the kinect sensor, Sensors, 16 (2016) 161:1-161:18.

[19]

E. Cippitelli, S. Gasparrini, E. Gambi, S. Spinsante, A human activity recognition system using skeleton data from RGBD sensors, Comput. Intell. Neurosci., 2016 (2016) 4351435:1-4351435:14.

Digital Library

[20]

C. Wang, Y. Wang, A.L. Yuille, Mining 3d key-pose-motifs for action recognition, 2016.

[21]

C. Wang, J. Flynn, Y. Wang, A.L. Yuille, Recognizing actions in 3d using action-snippets and activated simplices, AAAI Press, 2016.

[22]

I. Lillo, J.C. Niebles, A. Soto, A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets, 2016.

[23]

Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, 2015.

[24]

V. Veeriah, V. Zhuang, G. Qi, Differential recurrent neural networks for action recognition, 2015.

[25]

J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. D., K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, 2015.

[26]

N. Neverova, C. Wolf, G. Lacey, L. Fridman, D. Chandra, B. Barbello, G. Taylor, Learning human identity from motion patterns, IEEE Access, 4 (2015) 1810-1820.

[27]

N. Neverova, C. Wolf, G. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition, IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016) 1692-1706.

[28]

M. Lingfei, L. Fan, Z. Yanjia, H. Anjie, Human physical activity recognition based on computer vision with deep learning model, 2016.

[29]

W. Pichao, L. Zhaoyang, H. Yonghong, L. Wanqing, Action recognition based on joint trajectory maps using convolutional neural networks, 2016.

[30]

L. Yanghao, L. Cuiling, X. Junliang, Z. Wenjun, Y. Chunfeng, L. Jiaying, Online human action detection using joint classification-regression recurrent neural networks, 2016.

[31]

J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal LSTM with trust gates for 3d human action recognition, 2016.

[32]

B. Mahasseni, S. Todorovic, Regularizing long short term memory with 3d human-skeleton sequences for action recognition, 2016.

[33]

W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks, AAAI Press, 2016.

[34]

J.F. Hu, W.S. Zheng, J.H. Lai, J. Zhang, Jointly learning heterogeneous features for rgb-d activity recognition, IEEE Trans. Pattern Anal. Mach. Intell., PP (2017).

[35]

S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, 2017.

[36]

F. Han, B. Reily, W. Hoff, H. Zhang, Space-time representation of people based on 3d skeletal data: a review, Comput. Vis. Image Understanding, 158 (2017) 85-105.

Digital Library

[37]

B. Ionescu, D. Coquin, P. Lambert, V. Buzuloi, Dynamic hand gesture recognition using the skeleton of the hand, EURASIP J. Appl. Signal Process., 13 (2005) 2101-2109.

Digital Library

[38]

K. Reddy, P. Latha, M. Babu, Hand gesture recognition using skeleton of hand and distance based metric, Adv. Comput. Inf. Technol., 198 (2011) 346-354.

[39]

C. Wang, S.C. Chan, A new hand gesture recognition algorithm based on joint color-depth superpixel earth movers distance, 2014.

[40]

Q. De Smedt, H. Wannous, J.P. Vandeborre, Skeleton-based dynamic hand gesture recognition, 2016.

[41]

Intel, RealSense SDK Developer Guide 6.0, 2015.

[42]

Y. Lecun, Y. Bengio, G. Hinton, Deep learning, Nature, 521 (2015) 436-444.

[43]

S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997) 1735-1780.

Digital Library

[44]

X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11), 15, 2011, pp. 315-323.

[45]

R. Raina, A. Madhavan, A.Y. Ng, Large-scale deep unsupervised learning using graphics processors, ACM, New York, NY, USA, 2009.

[46]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res., 15 (2014) 1929-1958.

Digital Library

[47]

Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Comput., 1 (1989) 541-551.

Digital Library

[48]

A.K. Jain, J. Mao, K.M. Mohiuddin, Artificial neural networks: a tutorial, Computer, 29 (1996) 31-44.

Digital Library

[49]

W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, 2010.

[50]

J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, 2012.

[51]

D.R. Faria, M. Vieira, C. Premebida, U. Nunes, Probabilistic human daily activity recognition towards robot-assisted living, 2015.

[52]

M.D. Zeiler, ADADELTA: an adaptive learning rate method, Comput. Res. Repository (CoRR), abs/1212.5701 (2012).

[53]

C. Chen, R. Jafari, N. Kehtarnavaz, Action recognition from depth sequences using depth motion maps-based local binary patterns, 2015.

[54]

M.A. Gowayyed, M. Torki, M.E. Hussein, M. El-Saban, Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition, AAAI Press, 2013.

[55]

R.E.F. Behnam, Stats-calculus pose descriptor feeding a discrete HMM low-latency detection and recognition system for 3D skeletal actions, Comput. Res. Repository (CoRR), abs/1509.09014 (2015).

[56]

R. Anirudh, P. Turaga, J. Su, A. Srivastava, Elastic functional coding of human actions: from vector-fields to latent variables, 2015.

[57]

G. Ling, C. Fu, Human action recognition using APJ3D and random forests, J. Softw. (JSW), 8 (2013) 2238-2245.

[58]

A. Liu, W. Nie, Y. Su, L. Ma, T. Hao, Z. Yang, Coupled hidden conditional random fields for RGB-D human action recognition, Signal Process., 112 (2015) 74-82.

Digital Library

[59]

Y. Zhu, W. Chen, G.-D. Guo, Fusing spatiotemporal features and joints for 3Daction recognition, 2013.

[60]

M. Jiang, J. Kong, G. Bebis, H. Huo, Informative joints based human action recognition using skeleton contexts, Signal Process., 33 (2015) 29-40.

Digital Library

[61]

I. Theodorakopoulos, D. Kastaniotis, G. Economou, S. Fotopoulos, Pose-based human action recognition via sparse representation in dissimilarity space, J. Vis. Commun. Image Represent., 25 (2014) 12-23.

Digital Library

[62]

J.Y. Chang, Nonparametric Gesture Labeling from Multi-modal Data, Springer International Publishing, Cham, pp. 503517.

Cited By

Singh RSingh L(2025)Dyhand: dynamic hand gesture recognition using BiLSTM and soft attention methodsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-024-03307-441:1(41-51)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s00371-024-03307-4
Shahi SMollyn VTymoszek Park CKang RLiberman ALevy OGong JBedri ALaput G(2024)Vision-Based Hand Gesture Customization from a Single DemonstrationProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676378(1-14)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676378
Liu JWang XWang CGao YLiu M(2024)Temporal Decoupling Graph Convolutional Network for Skeleton-Based Gesture RecognitionIEEE Transactions on Multimedia10.1109/TMM.2023.327181126(811-823)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3271811
Show More Cited By

Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Maxout neurons for deep convolutional and LSTM neural networks in speech recognition

We combine maxout neurons with convolutional and LSTM structures for DNNs.The optimal network structures and training strategies are explored for the models.Experiments are carried out for 6 languages on the IARPA Babel data sets.State-of-the-art ...
Scene text recognition using residual convolutional recurrent neural network

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...
Deep neural learning techniques with long short-term memory for gesture recognition
Abstract
Gesture recognition is a kind of biometric which has assumed great significance in the field of computer vision for communicating information through human activities. To recognize the various gestures and achieve efficient classification, an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Pattern Recognition

Pattern Recognition Volume 76, Issue C

April 2018

669 pages

ISSN:0031-3203

Issue’s Table of Contents

Copyright © Elsevier Ltd.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 April 2018

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh RSingh L(2025)Dyhand: dynamic hand gesture recognition using BiLSTM and soft attention methodsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-024-03307-441:1(41-51)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s00371-024-03307-4
Shahi SMollyn VTymoszek Park CKang RLiberman ALevy OGong JBedri ALaput G(2024)Vision-Based Hand Gesture Customization from a Single DemonstrationProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676378(1-14)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676378
Liu JWang XWang CGao YLiu M(2024)Temporal Decoupling Graph Convolutional Network for Skeleton-Based Gesture RecognitionIEEE Transactions on Multimedia10.1109/TMM.2023.327181126(811-823)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3271811
Zhao DLi HYan S(2024)Spatial–Temporal Synchronous Transformer for Skeleton-Based Hand Gesture RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329508434:3(1403-1412)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TCSVT.2023.3295084
Mehmood FChen EAzeem Akbar MAzam Zia MAlsanad AAbdullah Alhogail ALi Y(2024)Advancements in Human Action Recognition Through 5G/6G Technology for Smart Cities: Fuzzy Integral-Based FusionIEEE Transactions on Consumer Electronics10.1109/TCE.2024.342093670:3(5783-5795)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TCE.2024.3420936
Mocaër WAnquetil EKulpa R(2024)Early gesture detection in untrimmed streamsPattern Recognition10.1016/j.patcog.2024.110733156:COnline publication date: 18-Nov-2024
https://dl.acm.org/doi/10.1016/j.patcog.2024.110733
Lotfipoor APatidar SJenkins D(2024)Deep neural network with empirical mode decomposition and Bayesian optimisation for residential load forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121355237:PAOnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121355
Mira AHellwich O(2024)Deep learning models beyond temporal frame-wise features for hand gesture video recognitionThe Journal of Supercomputing10.1007/s11227-024-05910-780:9(12430-12462)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s11227-024-05910-7
He YLiang ZHe SWang YYin M(2024)Viewpoint guided multi-stream neural network for skeleton action recognitionMultimedia Tools and Applications10.1007/s11042-023-15676-483:3(6783-6802)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15676-4
Goyal PRani RSingh K(2024)A multilayered framework for diagnosis and classification of Alzheimer's disease using transfer learned Alexnet and LSTMNeural Computing and Applications10.1007/s00521-023-09301-636:7(3777-3801)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00521-023-09301-6
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents