Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

Qin Cheng^1,2,
Jun Cheng^2,3,
Ziliang Ren⁵,
Qieshi Zhang^2,3 &
…
Jianming Liu ORCID: orcid.org/0000-0001-9158-0909⁴

1374 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

The skeleton data convey significant information for action recognition since they can robustly against cluttered backgrounds and illumination variation. In recent years, due to the limited ability to extract spatial–temporal features from skeleton data, the methods based on convolutional neural network (CNN) or recurrent neural network are inferior in recognition accuracy. A series of methods based on graph convolutional networks (GCN) have achieved remarkable performance and gradually become dominant. However, the computational cost of GCN-based methods is quite heavy, several works even over 100 GFLOPs. This is contrary to the highly condensed attributes of skeleton data. In this paper, a novel multi-scale spatial–temporal convolutional (MSST) module is proposed to take the implicit complementary advantages across spatial–temporal representations with different scales. Instead of converting skeleton data into pseudo-images like some previous CNN-based methods or using complex graph convolution, we take full use of multi-scale convolutions on temporal and spatial dimensions to capture comprehensive dependencies of skeleton joints. Unifying the MSST module, a multi-scale spatial–temporal convolutional neural network (MSSTNet) is proposed to capture high-level spatial–temporal semantic features for action recognition. Unlike previous methods which boost performance at the cost of computation, MSSTNet can be easily implemented with light model size and fast inference. Moreover, MSSTNet is used in a four-stream framework to fuse data of different modalities, providing notable improvement to recognition accuracy. On NTU RGB+D 60, NTU RGB+D 120, UAV-Human and Northwestern-UCLA datasets, the proposed MSSTNet achieves competitive performance with much less computational cost than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Article 10 January 2023

Semantic-guided multi-scale human skeleton action recognition

Article 12 August 2022

Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1110–1118 . https://doi.org/10.1109/CVPR.2015.7298714
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3590–3598 . https://doi.org/10.1109/CVPR.2019.00371
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision (ECCV), pp. 816–833. Springer, Cham
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 7904–7913.https://doi.org/10.1109/CVPR.2019.00810
Shi L, Zhang Y, Cheng J, Lu H (2019)Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12018–12027. https://doi.org/10.1109/CVPR.2019.01230
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 7444–7452. https://ojs.aaai.org/index.php/AAAI/article/view/12328
Fernando B, Gavves E, José Oramas M, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 5378–5387 . https://doi.org/10.1109/CVPR.2015.7299176
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 588–595. https://doi.org/10.1109/CVPR.2014.82
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4570–4579 . https://doi.org/10.1109/CVPR.2017.486
Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp. 1623–1631. https://doi.org/10.1109/CVPRW.2017.207
Li C, Zhong Q, Di X, Pu S (2017) Skeleton-based action recognition with convolutional neural networks. In: IEEE international conference on multimedia expo workshops (ICMEW), pp. 597–600 . https://doi.org/10.1109/ICMEW.2017.8026285
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(8):1963–1978. https://doi.org/10.1109/TPAMI.2019.2896631
Article Google Scholar
Shahroudy A, Liu J, Ng T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1010–1019. https://doi.org/10.1109/CVPR.2016.115
Xu Y, Hou Z, Liang J, Chen C, Jia L, Song Y (2019) Action recognition using weighted fusion of depth images and skeletons key frames. Multimed Tools Appl (MTAP) 78(17):25063–25078
Article Google Scholar
Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 8561–8568 . https://doi.org/10.1609/aaai.v33i01.33018561. https://ojs.aaai.org/index.php/AAAI/article/view/4875
Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process (TIP) 29:9532–9545. https://doi.org/10.1109/TIP.2020.3028207
Article MATH Google Scholar
Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl Based Syst (KBS) 122:64–74. https://doi.org/10.1016/j.knosys.2017.01.035
Article Google Scholar
Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 7864–7873. https://doi.org/10.1109/CVPR.2019.00806
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: European conference on computer vision (ECCV), pp. 318–335
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: European conference on computer vision (ECCV), pp. 713–730
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 588–597 . https://doi.org/10.1109/CVPR42600.2020.00067
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 140–149 . https://doi.org/10.1109/CVPR42600.2020.00022
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2020) Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell (TPAMI) 42(10):2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873
Article Google Scholar
Li T, Liu J, Zhang W, Ni Y, Wang W, Li Z (2021) UAV-Human: a Large Benchmark for Human Behavior Understanding With Unmanned Aerial Vehicles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 16266–16275 . https://doi.org/10.1109/CVPR46437.2021.01600
Wang J, Nie X, Xia Y, Wu Y, Zhu (2014)S Cross-view action modeling, learning, and recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 2649–2656 . https://doi.org/10.1109/CVPR.2014.339
Hussein M, Torki M, Gowayyed M, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: International joint conference on artificial intelligence (IJCAI), pp. 2466–2472
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1290–1297 . https://doi.org/10.1109/CVPR.2012.6247813
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI conference on artificial intelligence (AAAI), pp. 3697–3703
Avola D, Cascio M, Cinque L, Foresti GL, Massaroni C, Rodolà E (2020) 2-d skeleton-based action recognition via two-branch stacked lstm-rnns. IEEE Trans Multimed 22(10):2481–2496. https://doi.org/10.1109/TMM.2019.2960588
Article Google Scholar
Cheng J, Ren Z, Zhang Q, Gao X, Hao F (2021) Cross-modality compensation convolutional neural networks for rgb-d action recognition. IEEE transactions on circuits and systems for video technology (TCSVT), 1–1 . https://doi.org/10.1109/TCSVT.2021.3076165
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2019) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(11):2740–2755. https://doi.org/10.1109/TPAMI.2018.2868668
Article Google Scholar
Ren Z, Zhang Q, Gao X, Hao P, Cheng J (2020) Multi-modality learning for human action recognition. Multimedia tools and applications (MTAP), 1–16
Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl Based Syst 158:43–53. https://doi.org/10.1016/j.knosys.2018.05.029
Article Google Scholar
Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: IEEE international conference on multimedia expo workshops (ICMEW), pp. 601–604 . https://doi.org/10.1109/ICMEW.2017.8026282
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit (PR) 68:346–362. https://doi.org/10.1016/j.patcog.2017.02.030
Article Google Scholar
Cao C, Lan C, Zhang Y, Zeng W, Lu H, Zhang Y (2019) Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans Circuits Syst Video Technol (TCSVT) 29(11):3247–3257. https://doi.org/10.1109/TCSVT.2018.2879913
Article Google Scholar
Tian D, Lu Z, Chen X, Ma L (2020) An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed Tools Appl (MTAP) 79(17–18):12679–12697
Article Google Scholar
Chen T, Wang S, Zhou D, Guan Y (2021) LSTA-Net: Long short-term Spatio-Temporal aggregation network for skeleton-based action recognition. arXiv
Chen Z, Li S, Yang B, Li Q, Liu H (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. Proc AAAI Conf Artif Intell 35:1113–1122. https://doi.org/10.1609/aaai.v35i2.16197
Article Google Scholar
Chen T, Zhou D, Wang J, Wang S, Guan Y, He X, Ding E (2021) Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-Based Action Recognition. In: Proceedings of the 29th ACM international conference on multimedia. MM ’21, pp. 4334–4342. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3474085.3475574
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (CVPR), pp. 180–189. https://doi.org/10.1109/CVPR42600.2020.00026
Cheng K, Zhang Y, Cao C, Shi L, Cheng J, Lu H (2020) Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In: Computer vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIV, pp. 536–553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-030-58586-0_32
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 2818–2826 . https://doi.org/10.1109/CVPR.2016.308
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1–9 . https://doi.org/10.1109/CVPR.2015.7298594
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: IEEE/CVF international conference on computer vision (ICCV), pp. 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4733 . https://doi.org/10.1109/CVPR.2017.502
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 6546–6555 . https://doi.org/10.1109/CVPR.2018.00685
Deng J, Dong W, Socher R, Li LJ, Li J, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 248–255 . https://doi.org/10.1109/CVPR.2009.5206848
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: IEEE international conference on computer vision (ICCV), pp. 2136–2145 . https://doi.org/10.1109/ICCV.2017.233
Wen Y, Gao L, Fu H, Zhang F, Xia S (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: AAAI conference on artificial intelligence (AAAI), pp. 8989–8996 . https://doi.org/10.1609/aaai.v33i01.33018989
Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1109–1118. https://doi.org/10.1109/CVPR42600.2020.00119
Wang M, Ni B, Yang X (2020) Learning multi-view interactional skeleton graph for action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI). https://doi.org/10.1109/TPAMI.2020.3032738
Article Google Scholar
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 . https://doi.org/10.1109/CVPR.2019.00132
Li T, Liu J, Zhang W, Duan L (2020) HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Computer Vision—ECCV 2020, pp. 420–436. Springer, Cham
Veeriah V, Zhuang N, Qi G (2015) Differential recurrent neural networks for action recognition. In: IEEE international conference on computer vision (ICCV), pp. 4041–4049 . https://doi.org/10.1109/ICCV.2015.460
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3d human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36(5):914–927. https://doi.org/10.1109/TPAMI.2013.198
Article Google Scholar
Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: IEEE international conference on computer vision (ICCV), pp. 1012–1020 . https://doi.org/10.1109/ICCV.2017.115

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. U1913202, U1813205, U1713213, 61772508, 61262074), CAS Key Technology Talent Program, Shenzhen Technology Project (nos. JCYJ20180507182610734, JSGG20191129094012321), National Natural Science Foundation of Guangdong Province (nos. 2022A1515140119), Dongguan Science and Technology Special Commissioner Project (nos. 20221800500362).

Author information

Authors and Affiliations

School of Electronic Engineering and Automation, Guilin University of Electronic Technology, Guilin, 541004, China
Qin Cheng
CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
Qin Cheng, Jun Cheng & Qieshi Zhang
The Chinese University of Hong Kong, Hong Kong, 999077, China
Jun Cheng & Qieshi Zhang
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
Jianming Liu
School of Computer Science and Technology, Dongguan University of Technology, Dongguan, 523808, China
Ziliang Ren

Authors

Qin Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Ziliang Ren
View author publications
You can also search for this author in PubMed Google Scholar
Qieshi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianming Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cheng, Q., Cheng, J., Ren, Z. et al. Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition. Pattern Anal Applic 26, 1303–1315 (2023). https://doi.org/10.1007/s10044-023-01156-w

Download citation

Received: 12 September 2021
Accepted: 21 March 2023
Published: 12 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01156-w

Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Semantic-guided multi-scale human skeleton action recognition

Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

Data Availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Semantic-guided multi-scale human skeleton action recognition

Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

Data Availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation