SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15074))

Included in the following conference series:

European Conference on Computer Vision

38 Accesses

Abstract

Existing zero-shot skeleton-based action recognition methods utilize projection networks to learn a shared latent space of skeleton features and semantic embeddings. The inherent imbalance in action recognition datasets, characterized by variable skeleton sequences yet constant class labels, presents significant challenges for alignment. To address the imbalance, we propose SA-DVAE—Semantic Alignment via Disentangled Variational Autoencoders, a method that first adopts feature disentanglement to separate skeleton features into two independent parts—one is semantic-related and another is irrelevant—to better align skeleton and semantic features. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. We conduct experiments on three benchmark datasets: NTU RGB+D, NTU RGB+D 120 and PKU-MMD, and our experimental results show that SA-DAVE produces improved performance over existing methods. The code is available at https://github.com/pha123661/SA-DVAE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-semantic Fusion Model For Generalized Zero-Shot Skeleton-Based Action Recognition

View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Article 16 January 2024

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Notes

1.
Official website: https://rose1.ntu.edu.sg/dataset/actionRecognition/.
2.
GitHub link: https://github.com/shahroudy/NTURGB-D.

References

Atzmon, Y., Chechik, G.: Adaptive confidence smoothing for generalized zero-shot learning. In: CVPR (2019)
Google Scholar
Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) Statistical Language and Speech Processing, pp. 1–37. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39593-2_1
Chapter Google Scholar
Chen, Z., et al.: Semantics disentangling for generalized zero-shot learning. In: ICCV (2021)
Google Scholar
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: CVPR (2020)
Google Scholar
Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., Jiaying, L.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)
Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., Carin, L.: Cyclical annealing schedule: a simple approach to mitigating KL vanishing. arXiv preprint arXiv:1903.10145 (2019)
Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: ICIP (2021)
Google Scholar
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: ICLR (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. (TOG) 35(4), 1–11 (2016)
Article Google Scholar
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Hubert Tsai, Y.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV (2017)
Google Scholar
Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., Bhowmik, A.: Intel RealSense stereoscopic depth cameras. In: CVPRW (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)
Article MathSciNet Google Scholar
Liu, J., et al.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. TPAMI 42(10), 2684–2701 (2019)
Article Google Scholar
Pedregosa, F., et al.: SciKit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Pourpanah, F., et al.: A review of generalized zero-shot learning methods. TPAMI 45(4), 4051–4070 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: CVPR (2019)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR (2016)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Google Scholar
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI. vol. 32 (2018)
Google Scholar
Yuan, Y., et al.: HRFormer: high-resolution vision transformer for dense prediction. In: NeurIPS (2021)
Google Scholar
Zhang, Z.: Microsoft Kinect sensor and its effect. IEEE Multimedia 19(2), 4–10 (2012)
Article Google Scholar
Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton-based action recognition via mutual information estimation and maximization. In: ACM MM (2023)
Google Scholar

Download references

Acknowledgments

This research was supported by the National Science and Technology Council of Taiwan under grant number 111-2622-8-002-028. The authors would like to thank the NSTC for its generous support.

Author information

Authors and Affiliations

Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei City, Taiwan
Sheng-Wei Li
Department of Computer Science and Information Engineering, National Taiwan University, Taipei City, Taiwan
Zi-Xiang Wei, Wei-Jie Chen, Yi-Hsin Yu & Jane Yung-jen Hsu
Department of Artificial Intelligence, Chang Gung University, Taoyuan City, Taiwan
Chih-Yuan Yang & Jane Yung-jen Hsu
Artificial Intelligence Research Center, Chang Gung University, Taoyuan City, Taiwan
Chih-Yuan Yang

Authors

Sheng-Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zi-Xiang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hsin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Yuan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jane Yung-jen Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng-Wei Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, SW., Wei, ZX., Chen, WJ., Yu, YH., Yang, CY., Hsu, J.Yj. (2025). SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15074. Springer, Cham. https://doi.org/10.1007/978-3-031-72640-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72640-8_25
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72639-2
Online ISBN: 978-3-031-72640-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-semantic Fusion Model For Generalized Zero-Shot Skeleton-Based Action Recognition

View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-semantic Fusion Model For Generalized Zero-Shot Skeleton-Based Action Recognition

View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 284 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation