ConvST-LSTM-Net: convolutional spatiotemporal LSTM networks for skeleton-based human action recognition

Abhilasha Sharma¹ &
Roshni Singh¹

663 Accesses
2 Citations
Explore all metrics

Abstract

Human action recognition (HAR) emphases on perceiving and identifying the action behavior done by humans within an image/video. The HAR activities include motion patterns and normal or abnormal activities like standing, walking, sitting, running, playing, falling, fighting, etc. Recently, it sparks the attention of researchers especially in 3D skeleton sequence. The actions of human can be represented via sequence of motions of skeletal keyjoints, although not all the skeleton keyjoints are informative in nature. Various approaches for HAR are used like LSTM, ConvLSTM, Conv-GRU, ST-LSTM, etc. Thus far, ST-LSTM approaches have shown tremendous performance in 3D skeleton sequence tasks but the detection of irrelevant keyjoints produce noise that deteriorates the performance of the model. So, the intent is to bring attention toward improving the efficacy of the model by focusing on informative keyjoint coordinates only. Therefore, the research paper introduces a new class of spatiotemporal LSTM approaches named as ConvST-LSTM-Net (convolutional spatiotemporal long short-term memory network) for skeleton-based action recognition. The prime focus of proposed model is to identify the informative keyjoints in each frame. The result of extensive experimental analysis exhibits that ConvST-LSTM-Net outperforms the state-of-the-art models on various benchmarks dataset, viz. NTU RGB + D 60, UT-Kinetics, UP-Fall Detection, UCF101, and HDMB51 for skeleton sequence data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Skeleton-Based Human Activity Recognition Using Bidirectional LSTM

Skeleton-based human activity recognition using ConvLSTM and guided feature learning

Article Open access 01 October 2021

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Availability of data and materials

Not applicable.

References

Chang Y, Tu Z, Xie W, and Yuan J (2020). Clustering driven deep autoencoder for video anomaly detection. In European Conference on Computer Vision (pp. 329–345). Springer, Cham.
Zhang D, He L, Tu Z, Zhang S, Han F, Yang B (2020) Learning motion representation for real-time spatio-temporal action localization. Pattern Recogn 103:107312
Article Google Scholar
Niu W, Long J, Han D and Wang Y-F, Human activity detection and recognition for video surveillance, in 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), vol. 1, June 2004, pp. 719–722 Vol.1.
Valera M, Velastin SA (2005) Intelligent distributed surveillance systems: a review. IEE Proc–Vision, Image Signal Process 152(2):192–204
Article Google Scholar
Lin W, Sun MT, Poovandran R, and Zhang Z (2008), Human activity recognition for video surveillance, in 2008 IEEE International Symposium on Circuits and Systems, pp. 2737–2740.
Kalimuthu S, Perumal T, Yaakob R, Marlisah E, and Babangida L (2021), Human activity recognition based on smart home environment and their applications, challenges. In 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 815–819). IEEE.
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117(6):633–659
Article Google Scholar
Patrona F, Chatzitofis A, Zarpalas D, Daras P (2018) Motion analysis: Action detection, recognition, and evaluation based on motion capture data. Pattern Recogn 76:612–622
Article Google Scholar
Vishwakarma DK, Dhiman A, Maheshwari R, Kapoor R (2015) Human motion analysis by fusion of silhouette orientation and shape features. Procedia Comput Sci 57:438–447
Article Google Scholar
Yao H, Hu X (2023) A survey of video violence detection. Cyber-Phys Syst 9(1):1–24
Article Google Scholar
Yang Y, Liu G, Gao X (2022) Motion guided attention learning for self-supervised 3D human action recognition. IEEE Trans Circuits Syst Video Technol 32(12):8623–8634
Article Google Scholar
Duan H, Wang J, Chen K, and Lin D (2022), DG-STGCN: dynamic spatial-temporal modeling for skeleton-based action recognition. arXiv preprint arXiv:2210.05895.
Liu S, Bai X, Fang M, Li L, Hung CC (2022) Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl Intell 52(2):1544–1555
Article Google Scholar
Abdulhussein AA, Hassen OA, Gupta C, Virmani D, Nair A, and Rani P (2022), Health monitoring catalogue based on human activity classification using machine learning. Int J Electrical Comput Eng, 12(4): (2088–8708).
Andrade-Ambriz YA, Ledesma S, Ibarra-Manzano MA, Oros-Flores MI, Almanza-Ojeda DL (2022) Human activity recognition using temporal convolutional neural network architecture. Expert Syst Appl 191:116287
Article Google Scholar
Qiu S, Zhao H, Jiang N, Wang Z, Liu L, An Y, Fortino G (2022) Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inform Fusion 80:241–265
Article Google Scholar
Wu L, Zhang C, Zou Y (2023) SpatioTemporal focus for skeleton-based action recognition. Pattern Recogn 136:109231
Article Google Scholar
Mahdikhanlou K, Ebrahimnezhad H (2023) 3D hand pose estimation from a single RGB image by weighting the occlusion and classification. Pattern Recogn 136:109217
Article Google Scholar
Dallel M, Havard V, Dupuis Y, and Baudry D (2022), A sliding window based approach with majority voting for online human action recognition using spatial temporal graph convolutional neural networks. In 2022 7th International Conference on Machine Learning Technologies (ICMLT) (pp. 155–163).
Sánchez-Caballero A, Fuentes-Jiménez D, and Losada-Gutiérrez C (2022) Real-time human action recognition using raw depth video-based recurrent neural networks. Multimedia Tools Appl, 1–23.
Yue R, Tian Z, and Du S (2022) Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing.
Khaire P, Kumar P (2022) Deep learning and RGB-D based human action, human–human and human–object interaction recognition: a survey. J Vis Commun Image Represent 86:103531
Article Google Scholar
Ding C, Wen S, Ding W, Liu K, Belyaev E (2022) Temporal segment graph convolutional networks for skeleton-based action recognition. Eng Appl Artif Intell 110:104675
Article Google Scholar
Setiawan F, Yahya BN, Chun SJ, Lee SL (2022) Sequential inter-hop graph convolution neural network (SIhGCN) for skeleton-based human action recognition. Expert Syst Appl 195:116566
Article Google Scholar
Khowaja SA, & Lee SL (2022) Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies. Journal of Ambient Intelligence and Humanized Computing, 1–18.
Hou R, Wang Z, Ren R, Cao Y, and Wang Z (2022). Multi-channel network: constructing efficient GCN baselines for skeleton-based action recognition. Comput Gr.
Gao BK, Dong L, Bi HB, Bi YZ (2022) Focus on temporal graph convolutional networks with unified attention for skeleton-based action recognition. Appl Intell 52(5):5608–5616
Article Google Scholar
Xu W, Wu M, Zhu J, Zhao M (2021) Multi-scale skeleton adaptive weighted GCN for skeleton-based human action recognition in IoT. Appl Soft Comput 104:107236
Article Google Scholar
Song YF, Zhang Z, Shan C, and Wang L (2020), Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In proceedings of the 28th ACM international conference on multimedia (pp. 1625–1633).
Wang L, Suter D (2007) Learning and matching of dynamic shape manifolds for human action recognition. IEEE Trans Image Process 16(6):1646–1661
Article MathSciNet Google Scholar
Shahroudy A, Liu J, Ng T-T, and Wang G (2016), Ntu rgb+d: a large scale dataset for 3d human activity analysis, in CVPR, 2016.
Xia L, Chen C-C, and Aggarwal JK, View invariant human action recognition using histograms of 3D joints, in Proc. CVPR, 2012, pp. 20–27 (2012)
Martínez-Villaseñor L, Ponce H, Brieva J, Moya-Albor E, Núñez-Martínez J, Peñafort-Asturiano C (2019) UP-fall detection dataset: a multimodal approach. Sensors 19(9):1988
Article Google Scholar
Soomro K, Zamir AR, and Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne HH, Jhuang E, Garrote T, Poggio and Serre T (2011) HMDB: a large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011, 2556–2563, https://doi.org/10.1109/ICCV.2011.6126543.
Vemulapalli R, Arrate F, and Chellappa R (2014), Human action recognition by representing 3d skeletons as points in a lie group, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.
Vemulapalli R and Chellapa R, Rolling rotations for recognizing human actions from 3d skeletal data, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4471– 4479.
Ke Q, Bennamoun M, An S, Sohel F, and Boussaid F (2017), A new representation of skeleton sequences for 3d action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3288–3297.
Li, B, Dai, Y, Cheng X, Chen H, Lin Y, and He M (2017), Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn, in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2017, pp. 601–604.
Li C, Zhong Q, Xie D and Pu S (2017), Skeleton-based action recognition with convolutional neural networks, in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2017, pp. 597–600.
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362
Article Google Scholar
Zhu K, Wang R, Zhao Q, Cheng J, Tao D (2019) A cuboid cnn model with an attention mechanism for skeleton-based action recognition. IEEE Trans Multimedia 22(11):2977–2989
Article Google Scholar
Liu J, Shahroudy A, Xu D, and Wang G (2016), Spatio-temporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 816–833.
Cao C, Lan C, Zhang Y, Zeng W, Lu H, Zhang Y (2018) Skeletonbased action recognition with gated convolutional neural networks. IEEE Trans Circuits Syst Video Technol 29(11):3247–3257
Article Google Scholar
Zhao R, Wang K, Su H, and Ji Q (2019), Bayesian graph convolution lstm for skeleton based action recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6882–6892.
Song S, Lan C, Xing J, Zeng W, and Liu J (2017), An end-to-end spatiotemporal attention model for human action recognition from skeleton data, in Thirty-first AAAI Conference on Artificial Intelligence, 2017.
Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Trans Multimedia 20(9):2330–2343
Article Google Scholar
Fan Z, Zhao X, Lin T, Su H (2018) Attention-based multiview reobservation fusion network for skeletal action recognition. IEEE Trans Multimedia 21(2):363–374
Article Google Scholar
Xie J, Miao Q, Liu R, Xin W, Tang L, Zhong S, Gao X (2021) Attention adjacency matrix based graph convolutional networks for skeleton-based action recognition. Neurocomputing 440:230–239
Article Google Scholar
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978
Article Google Scholar
Yan S, Xiong Y, and Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence.
Song YF, Zhang Z, Shan C, and Wang L (2020). Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In proceedings of the 28th ACM international conference on multimedia (pp. 1625–1633).
Song YF, Zhang Z, Shan C, Wang L (2022) Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans Pattern Anal Mach Intell 45(2):1474–1488
Article Google Scholar
Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545
Article MATH Google Scholar
Cheng K, Zhang Y, He X, Chen W, Cheng J, & Lu H (2020). Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 183–192).
Ye, L., & Ye, S. (2021, April). Deep learning for skeleton-based action recognition. In Journal of Physics: Conference Series (Vol. 1883, No. 1, p. 012174). IOP Publishing.
Zhang J, Ye G, Tu Z, Qin Y, Zhang J, Liu X, and Luo S, A spatial attentive and temporal dilated (satd) gcn for skeleton-based action recognition, CAAI Transactions on Intelligence Technology, (2020).
Shi L, Zhang Y, Cheng J, and Lu H (2019), Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 026–12 035
Shi L, Zhang Y, Cheng J, and Lu H (2019), Skeleton-based action recognition with directed graph neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7912–7921.
Zhang X, Xu C, Tian X, Tao D (2019) Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Transact Neural Netw Learn Syst 31(8):3047–3060
Article Google Scholar
Veeriah V, Zhuang N, and Qi G-J (2015), Differential recurrent neural networks for action recognition, in ICCV, 2015.
Du Y, Wang W and Wang L (2015), Hierarchical recurrent neural network for skeleton based action recognition, in CVPR, 2015.
Zhu W, Lan, C, Xing J, Zeng W, Li Y, Shen L, and Xie X (2016), Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks, in AAAI, 2016.
Liu J, Shahroudy A, Xu D, Kot AC, and Wang G (2017), Skeleton-based action recognition using spatio-temporal lstm network with trust gates, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Jain A, Zamir AR, Savarese S, and Saxena A (2016), Structural-rnn: deep learning on spatio-temporal graphs, in CVPR, 2016.
Li Y, Lan C, Xing J, Zeng W, Yuan C, and Liu J (2016) Online human action detection using joint classification-regression recurrent neural networks, in ECCV, 2016.
Yadav SK, Tiwari K, Pandey HM, & Akbar SA (2022), Skeleton-based human activity recognition using ConvLSTM and guided feature learning. Soft Comput, 1–14.
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Hu JF, Zheng W-S, Lai, J and Zhang J (2015), “Jointly learning heterogeneous features for RGB-D activity recognition,” in Proc. CVPR, 2015, pp. 5344–5352.
Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599
Article MathSciNet MATH Google Scholar
Yadav SK, Luthra A, Tiwari K, Pandey HM, Akbar SA (2022) ARFDNet: an efficient activity recognition & fall detection system using latent feature pooling. Knowl-Based Syst 239:107948
Article Google Scholar
Xu, H, Gao Y, Hui Z, Li J, and Gao X (2023), Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398.
Liu J, Wang X, Wang C, Gao Y, and Liu M (2023) Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Transactions on Multimedia
Huang, X., Zhou H, Feng B, Wang X, Liu W, Wang J, Feng H, Han J, Ding E, and Wang J (2023) Graph contrastive learning for skeleton-based action recognition. arXiv preprint arXiv:2301.10900 (2023).
Duan, H, Wang J, Chen K, and Lin D (2022) Pyskl: towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7351–7354. 2022.
Duan, H, Zhao Y, Chen K, Lin D, and Dai B (2022) Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978.

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Software Engineering, Delhi Technological University, Shahbad Daulatpur, Delhi, 110042, India
Abhilasha Sharma & Roshni Singh

Authors

Abhilasha Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Roshni Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Roshni Singh, carried out the related studies, participated in the sequence alignment and drafted the manuscript along with performances and statistical analysis. Dr. Abhilasha Sharma, conceived of the study and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Roshni Singh.

Ethics declarations

Conflict of interest

The authors have no relevant financial or nonfinancial interest to disclose. The authors have no competing interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interest in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, A., Singh, R. ConvST-LSTM-Net: convolutional spatiotemporal LSTM networks for skeleton-based human action recognition. Int J Multimed Info Retr 12, 34 (2023). https://doi.org/10.1007/s13735-023-00301-9

Download citation

Received: 11 February 2023
Revised: 10 August 2023
Accepted: 24 September 2023
Published: 27 October 2023
DOI: https://doi.org/10.1007/s13735-023-00301-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-Based Human Activity Recognition Using Bidirectional LSTM

Skeleton-based human activity recognition using ConvLSTM and guided feature learning

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

ConvST-LSTM-Net: convolutional spatiotemporal LSTM networks for skeleton-based human action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-Based Human Activity Recognition Using Bidirectional LSTM

Skeleton-based human activity recognition using ConvLSTM and guided feature learning

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation