JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning †
<p>IE-Graph Representation: The joint edge (black) links the joints within the same subject, and the subject edge (red) links the joints of the same kind between different subjects, which models the interactive information. For brevity, we show only a portion of the subject edges in the figure.</p> "> Figure 2
<p>The Joint Attention Module first pools the spatial-temporal feature map, and then uses the linear layer and <span class="html-italic">Softmax</span> to obtain the attention weight, which is used to perform element multiplication with the original feature map. Finally, each joint receives different attention in the feature map.</p> "> Figure 3
<p>Our framework consists of two parts: pre-training and fine-tuning. In pre-training, two training samples are generated from each raw data by common data augmentation. With the help of IE-Graph, we feed the samples to the network to extract joint features and update the parameters using the proposed contrastive loss. In fine-tuning, the pre-trained weights are used as the initialization and are further refined on the target downstream task. In addition, we add a prediction head to complete the recognition.</p> "> Figure 4
<p>The interaction recognition accuracy on the NTU60 dataset: ST-TR, MS-G3D and our JointContrast.</p> "> Figure 5
<p>Comparison in terms of interaction recognition accuracy of models w/ and w/o pre-training on the NTU60 dataset. schedules means the training epochs.</p> "> Figure 6
<p>The affinity matrices between joints after pre-training, where the numbers represent the different joints.</p> "> Figure 7
<p>Attention visualization for the handshake action, where the top ones corresponds to the shallow layer of the network. The attention weights are normalized.</p> ">
Abstract
:1. Introduction
- We propose an innovative interaction information embedding graph representation (IE-Graph), which represents interaction information as subject-adjacent edges and helps in the representation of interaction and better feature fusion;
- With the help of IE-Graph, the model can use graph convolution to represent the intra-subject spatial information and inter-subject interaction information in a uniform manner, which allows us to generalize the single-subject action recognition methods to interaction recognition easily;
- We propose a contrastive loss as well as an unsupervised pre-training framework for skeletal data;
- We perform experiments on three popular benchmarks, including SUB, NTU60, and NTU120, and the results show that JointContrast achieves competitive results in comparison with several popular baseline methods.
2. Related Works
2.1. Skeleton-Based Action Recognition
2.2. Interaction Recognition
2.3. Contrastive Learning
3. Methods
3.1. Preliminaries
3.2. Interaction Embedding Graph
3.3. Joint Attention Module
3.4. Pre-Training and Contrastive Loss
Algorithm 1 Joint contrast learning algorithm. |
|
3.5. Backbone and Fine-Tuning
4. Experiments
4.1. Datasets and Metrics
4.1.1. SBU Dataset
4.1.2. NTU RGB-D Dataset
4.1.3. Performance Metrics
4.2. Experimental Setup
4.3. Results and Discussion
4.3.1. SBU Dataset
4.3.2. NTU60 and NTU120 Datasets
4.3.3. Hyper-Parameters Analysis
4.4. Ablation Study
4.4.1. Pre-Traning with Contrastive Learning
4.4.2. Interaction Embedding Graph
4.4.3. Joint Attention
4.4.4. Bones Information
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision—ECCV 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
- Han, F.; Reily, B.; Hoff, W.; Zhang, H. Space-time representation of people based on 3D skeletal data: A review. Comput. Vis. Image Underst. 2017, 158, 85–105. [Google Scholar] [CrossRef] [Green Version]
- Perez, M.; Liu, J.; Kot, A.C. Interaction relational network for mutual action recognition. IEEE Trans. Multimed. 2021, 24, 366–376. [Google Scholar] [CrossRef]
- Yang, C.L.; Setyoko, A.; Tampubolon, H.; Hua, K.L. Pairwise adjacency matrix on spatial temporal graph convolution network for skeleton-based two-person interaction recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 2166–2170. [Google Scholar]
- Nguyen, X.S. Geomnet: A neural network based on riemannian geometries of spd matrix space and cholesky space for 3d skeleton-based interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13379–13389. [Google Scholar]
- Khaire, P.; Kumar, P. Deep learning and RGB-D based human action, human–human and human–object interaction recognition: A survey. J. Vis. Commun. Image Represent. 2022, 86, 103531. [Google Scholar] [CrossRef]
- Gao, F.; Xia, H.; Tang, Z. Attention Interactive Graph Convolutional Network for Skeleton-Based Human Interaction Recognition. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Kasprzak, W.; Piwowarski, P.; Do, V.K. A lightweight approach to two-person interaction classification in sparse image sequences. In Proceedings of the 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), Sofia, Bulgaria, 4–7 September 2022; pp. 181–190. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning representations by maximizing mutual information across views. Adv. Neural Inf. Process. Syst. 2019, 32, 15535–15545. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 776–794. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Misra, I.; van der Maaten, L. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6707–6717. [Google Scholar]
- Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November 30–4 December 2020. [Google Scholar]
- Liu, Y.; Zhang, H.; Xu, D.; He, K. Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl.-Based Syst. 2022, 240, 108146. [Google Scholar] [CrossRef]
- Ji, Y.; Cheng, H.; Zheng, Y.; Li, H. Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 2015, 33, 340–349. [Google Scholar] [CrossRef]
- Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T.L.; Samaras, D. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 28–35. [Google Scholar]
- Ouyed, O.; Allili, M.S. Group-of-features relevance in multinomial kernel logistic regression and application to human interaction recognition. Expert Syst. Appl. 2020, 148, 113247. [Google Scholar] [CrossRef]
- Ji, Y.; Ye, G.; Cheng, H. Interactive body part contrast mining for human interaction recognition. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
- Liu, B.; Ju, Z.; Liu, H. A structured multi-feature representation for recognizing human action and interaction. Neurocomputing 2018, 318, 287–296. [Google Scholar] [CrossRef] [Green Version]
- Li, M.; Leung, H. Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans. Multimed. 2016, 18, 2293–2302. [Google Scholar] [CrossRef]
- Ito, Y.; Morita, K.; Kong, Q.; Yoshinaga, T. Multi-Stream Adaptive Graph Convolutional Network Using Inter-and Intra-Body Graphs for Two-Person Interaction Recognition. IEEE Access 2021, 9, 110670–110682. [Google Scholar] [CrossRef]
- Pang, Y.; Ke, Q.; Rahmani, H.; Bailey, J.; Liu, J. IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 605–622. [Google Scholar]
- Jia, X.; Zhang, J.; Wang, Z.; Luo, Y.; Chen, F.; Xiao, J. JointContrast: Skeleton-Based Mutual Action Recognition with Contrastive Learning. In Proceedings of the PRICAI 2022: Trends in Artificial Intelligence: 19th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2022, Shanghai, China, 10–13 November 2022; Part III. pp. 478–489. [Google Scholar]
- Chiu, S.Y.; Wu, K.R.; Tseng, Y.C. Two-Person Mutual Action Recognition Using Joint Dynamics and Coordinate Transformation. In Proceedings of the CAIP 2021: The 1st International Conference on AI for People: Towards Sustainable AI, CAIP 2021, Bologna, Italy, 20–24 November 2021; p. 56. [Google Scholar]
- Yang, H.; Yan, D.; Zhang, L.; Sun, Y.; Li, D.; Maybank, S.J. Feedback graph convolutional network for skeleton-based action recognition. IEEE Trans. Image Process. 2021, 31, 164–175. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G.E. Big self-supervised models are strong semi-supervised learners. Adv. Neural Inf. Process. Syst. 2020, 33, 22243–22255. [Google Scholar]
- Singh, A.; Chakraborty, O.; Varshney, A.; Panda, R.; Feris, R.; Saenko, K.; Das, A. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10389–10399. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 6–14 December 2021; pp. 8748–8763. [Google Scholar]
- Qian, R.; Meng, T.; Gong, B.; Yang, M.H.; Wang, H.; Belongie, S.; Cui, Y. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6964–6974. [Google Scholar]
- Zamini, M.; Reza, H.; Rabiei, M. A Review of Knowledge Graph Completion. Information 2022, 13, 396. [Google Scholar] [CrossRef]
- Guo, L.; Wang, W.; Sun, Z.; Liu, C.; Hu, W. Decentralized knowledge graph representation learning. arXiv 2020, arXiv:2010.08114. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [Green Version]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
- Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
- Cho, S.; Maqbool, M.; Liu, F.; Foroosh, H. Self-attention network for skeleton-based human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 635–644. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Proceedings of the International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; pp. 694–701. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Dataset | T | M | N | ||
---|---|---|---|---|---|
SBU | 40 | 2 | 15 | 2 | 1 |
NTU60 & NTU120 | 300 | 2 | 25 | 2 | 1 |
Methods | acc |
---|---|
CFDM [21] | 89.4% |
ST-LSTM [1] | 93.3% |
Co-occurrence LSTM [43] | 90.4% |
VA-LSTM [44] | 97.2% |
2sGCA-LSTM [45] | 94.9% |
SGCConv [46] | 94.0% |
LSTM-IRN [4] | 98.2% |
JointContrast (ours) | 98.2% |
NTU60 | NTU120 | |||
---|---|---|---|---|
Methods | CS acc (%) | CV acc (%) | CS acc (%) | CV acc (%) |
ST-LSTM [1] | 83.0 | 87.3 | 63.0 | 66.6 |
LSTM-IRN [4] | 90.5 | 93.5 | 77.7 | 79.6 |
AGC-LSTM [39] | 89.2 | 95.0 | 73.0 | 73.3 |
SAN [47] | 88.2 | 93.5 | - | - |
VACNN [48] | 88.9 | 94.7 | - | - |
ST-GCN [17] | 83.3 | 88.7 | - | - |
AS-GCN [18] | 87.6 | 95.2 | 82.9 | 83.7 |
ST-TR [49] | 90.8 | 96.5 | 85.7 | 87.1 |
2sshift-GCN [50] | 90.3 | 96.0 | 86.1 | 86.7 |
MS-G3D [51] | 91.7 | 96.1 | - | - |
2sKA-AGTN [20] | 90.4 | 96.1 | 86.7 | 88.2 |
JointContrast (ours) | 94.1 | 96.8 | 88.2 | 88.9 |
Methods | K1 | K2 | CS acc (%) |
---|---|---|---|
(a) | 2 | 1 | 93.3 |
(a) | 2 | 2 | 93.8 |
(b) | 3 | 1 | 94.1 |
(c) | 3 | 2 | 94.1 |
Methods | CS acc (%) | CV acc (%) | IE-Graph | Parameters () |
---|---|---|---|---|
ST-GCN | 83.3 | 88.7 | w/o | 1.2 |
ST-GCN | 88.5 | 92.1 | w/ | 1.3 |
GC-LSTM | 89.2 | 95.0 | w/o | 1.17 |
GC-LSTM | 93.7 | 96.1 | w/ | 1.28 |
Methods | CS acc (%) | Decrease (%) |
---|---|---|
GLIA(3-A) | 93.7 | - |
GLIA(2-A) | 93.1 | 0.6 |
GLIA(1-A) | 91.7 | 1.4 |
GLI(non-A) | 89.0 | 2.7 |
Methods | Joint | Bone | CS acc (%) |
---|---|---|---|
JointContrast (joints) | yes | - | 93.3 |
JointContrast (bones) | - | yes | 92.8 |
JointContrast (both) | yes | yes | 94.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Jia, X.; Wang, Z.; Luo, Y.; Chen, F.; Yang, G.; Zhao, L. JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning. Algorithms 2023, 16, 190. https://doi.org/10.3390/a16040190
Zhang J, Jia X, Wang Z, Luo Y, Chen F, Yang G, Zhao L. JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning. Algorithms. 2023; 16(4):190. https://doi.org/10.3390/a16040190
Chicago/Turabian StyleZhang, Ji, Xiangze Jia, Zhen Wang, Yonglong Luo, Fulong Chen, Gaoming Yang, and Lihui Zhao. 2023. "JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning" Algorithms 16, no. 4: 190. https://doi.org/10.3390/a16040190
APA StyleZhang, J., Jia, X., Wang, Z., Luo, Y., Chen, F., Yang, G., & Zhao, L. (2023). JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning. Algorithms, 16(4), 190. https://doi.org/10.3390/a16040190