Clustering-based multi-featured self-supervised learning for human activities and video retrieval

Muhammad Hafeez Javed^1,2,3,4,5^na1,
Zeng Yu^1,2,3,4^na1,
Taha M. Rajeh^1,2,3,4,5^na1,
Fahad Rafique⁶ &
…
Tianrui Li ORCID: orcid.org/0000-0002-7090-0325^1,2,3,4^na1

250 Accesses
Explore all metrics

Abstract

Human-centric content-based video retrieval has emerged as a prominent area of research due to its diverse applications. However, this task presents several inherent challenges, including end-to-end image classification and data sampling. Despite the significant progress made by self-supervised learning methods in addressing these challenges, there are still some issues that need to be addressed. Among those, one major concern is the generation of randomly sampled inverse-complementary pairs. The process of generating such pairs requires careful handling to avoid false positives. Moreover, a common assumption that the similarity between video clips is solely temporal neglects the role of other factors, such as motion. To address this issue, a clustering-based multi-featured self-supervised learning model called CMS2L is proposed in this paper. Our model introduces a fundamental improvement by fixing intra-class positive sampling to avoid false labeling during stage training due to looping clusters. Additionally, it employs a second stream with an expanded range of features to achieve a more comprehensive representation of actions. Experimental results on benchmark datasets demonstrate the superiority of our proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Article 19 September 2022

Integrating pseudo labeling with contrastive clustering for transformer-based semi-supervised action recognition

Article 10 August 2024

GOCA: Guided Online Cluster Assignment for Self-supervised Video Representation Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability and Access

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.

References

Ramasamy Ramamurthy S, Roy N (2018) Recent trends in machine learning for human activity recognition-a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4):1254
Google Scholar
Jangir MK, Singh K (2019) Hargrurnn: Human activity recognition using inertial body sensor gated recurrent units recurrent neural network. Journal of Discrete Mathematical Sciences and Cryptography 22(8):1577–1587
Article Google Scholar
Saini R, Kumar P, Roy PP, Dogra DP (2018) A novel framework of continuous human-activity recognition using kinect. Neurocomputing 311:99–111
Article Google Scholar
Javed MH, Yu Z, Li T, Rajeh TM, Rafique F, Waqar S (2022) Hybrid two-stream dynamic cnn for view adaptive human action recognition using ensemble learning. Int J Mach Learn Cybern 13(4):1157–1166
Article Google Scholar
Barbosa R, Ogobuchi OD, Joy OO, Saadi M, Rosa RL, Otaibi SA, Rodríguez DZ (2023) Iot based real-time traffic monitoring system using images sensors by sparse deep learning algorithm. Comput Commun 210:321–330. https://doi.org/10.1016/j.comcom.2023.08.007
Article Google Scholar
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058
Article Google Scholar
Koupai AK, Bocus MJ, Santos-Rodriguez R, Piechocki RJ, McConville R (2022) Self-supervised multimodal fusion transformer for passive activity recognition. IET Wireless Sensor Systems
Teng Y, Song C, Wu B (2022) Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process Lett 29:1377–1381
Article Google Scholar
Chang S, Li Y, Shen S, Feng J, Zhou Z (2021) Contrastive attention for video anomaly detection. IEEE Trans Multimedia 24:4067–4076
Article Google Scholar
Xi L, Yun Z, Liu H, Wang R, Huang X, Fan H (2022) Semi-supervised time series classification model with self-supervised learning. Eng Appl Artif Intell 116:105331. https://doi.org/10.1016/j.engappai.2022.105331
Article Google Scholar
Saeed A, Salim FD, Ozcelebi T, Lukkien J (2020) Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Internet Things J 8(2):1030–1040
Article Google Scholar
Li P, Cao J, Ye X (2023) Prototype contrastive learning for point-supervised temporal action detection. Expert Syst Appl 213:118965
Article Google Scholar
Zhang H, Zhao S, Qiang W, Chen Y, Jing L (2022) Feature extraction framework based on contrastive learning with adaptive positive and negative samples. Neural Netw 156:244–257
Article Google Scholar
Zhang X, Li Q, Quan Z, Yang W (2023) Pyramid geometric consistency learning for semantic segmentation. Pattern Recogn 133:109020
Article Google Scholar
Wang Z, Lyu J, Luo W, Tang X (2022) Superpixel inpainting for self-supervised skin lesion segmentation from dermoscopic images. In: Proceedings of the International Symposium on Biomedical Imaging (ISBI), pp. 1–4. IEEE
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035
Pan T, Song Y, Yang T, Jiang W, Liu W (2021) Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214
Liu X, Li Y, Xia R (2020) Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition. SIViP 14(6):1227–1234
Article Google Scholar
Wan Y, Yu Z, Wang Y, Li X (2020) Action recognition based on two-stream convolutional networks with long-short-term spatiotemporal features. IEEE Access 8:85284–85293
Article Google Scholar
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28
Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628
Article Google Scholar
Han Y, Hui L, Jiang H, Qian J, Xie J (2022) Generative subgraph contrast for self-supervised graph representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 91–107. Springer
Zhang R, Luo Y, Ma J, Zhang M, Wang S (2022) scpretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 38(6):1607–1614
Article Google Scholar
Huang L, Liu Y, Wang B, Pan P, Xu Y, Jin R (2021) Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.13886–13895
Li Y, Chen J, Li F, Fu B, Wu H, Ji Y, Zhou Y, Niu Y, Shi G, Zheng W (2022) Gmss: Graph-based multi-task self-supervised learning for eeg emotion recognition. IEEE Transactions on Affective Computing
Zhang P, Zhou L, Bai X, Wang C, Zhou J, Zhang L, Zheng J (2022) Learning multi-view visual correspondences with self-supervision. Displays 72:102160
Article Google Scholar
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666. Springer
Ma Z, Li K, Li Y (2023) Self-supervised method for 3d human pose estimation with consistent shape and viewpoint factorization. Applied Intelligence 3864–3876
Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060
Benaim S, Ephrat A, Lang O, Mosseri I, Freeman WT, Rubinstein M, Irani M, Dekel T (2020) Speednet: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931
Yao Y, Liu C, Luo D, Zhou Y, Ye Q (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural Information Processing Systems 29
Lin S, Liu C, Zhou P, Hu ZY, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X (2022) Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems
Liu Y, Wang K, Liu L, Lan H, Lin L (2022) Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Trans Image Process 31:1978–1993
Article Google Scholar
Liu Y, Ma J, Xie Y, Yang X, Tao X, Peng L, Gao W (2022) Contrastive predictive coding with transformer for video representation learning. Neurocomputing 482:154–162
Article Google Scholar
Yu Z, Wang J, Lu G (2019) Optimized self-adapting contrast enhancement algorithm for wafer contour extraction. Multimedia Tools and Applications 78:32087–32108
Article Google Scholar
Zhang W, Deng Z, Zhang T, Choi KS, Wang S (2023) Multi-view fuzzy representation learning with rules based model. IEEE Transactions on Knowledge and Data Engineering
Farhat M, Chaabouni-Chouayakh H, Ben-Hamadou A (2023) Self-supervised endoscopic image key-points matching. Expert Syst Appl 213:118696
Article Google Scholar
Grill JB, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Adv Neural Inf Process Syst 33:21271–21284
Google Scholar
Feichtenhofer C, Fan H, Xiong B, Girshick R, He K (2021) A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309
Ju J, Jung H, Oh Y, Kim J (2022) Extending contrastive learning to unsupervised coreset selection. IEEE Access 10:7704–7715
Article Google Scholar
Ye J, Xiao Q, Wang J, Zhang H, Deng J, Lin Y (2021) Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Process Lett 29:189–193
Article Google Scholar
Liu S, Sehgal N, Ostadabbas S (2022) Adapted human pose: monocular 3d human pose estimation with zero real 3d pose data. Appl Intell 52(12):14491–14506
Article Google Scholar
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149
Xie J, Zhan X, Liu Z, Ong YS, Loy CC (2021) Unsupervised object-level representation learning from scene images. Adv Neural Inf Process Syst 34:28864–28876
Google Scholar
Yan X, Misra I, Gupta A, Ghadiyaram D, Mahajan D (2020) Clusterfit: Improving generalization of visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6509–6518
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Google Scholar
Ntelemis F, Jin Y, Thomas SA (2022) Information maximization clustering via multi-view self-labelling. Knowledge-Based Systems 109042
Zhu Y, Shuai H, Liu G, Liu Q (2022) Self-supervised video representation learning using improved instance-wise contrastive learning and deep clustering. IEEE Transactions on Circuits and Systems for Video Technology
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297
Khorasgani SH, Chen Y, Shkurti F (2022) Slic: Self-supervised learning with iterative clustering for human action videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101
Kumar V, Tripathi V, Pant B (2022) Learning unsupervised visual representations using 3d convolutional autoencoder with temporal contrastive modeling for video retrieval. International Journal of Mathematical, Engineering and Management Sciences
Zatsarynna O, Farha YA, Gall J (2022) Self-supervised learning for unintentional action prediction. In: Proceedings of the DAGM German Conference on Pattern Recognition, pp. 429–444. Springer
Hua J, Cui X, Li X, Tang K, Zhu P (2023) Multimodal fake news detection through data augmentation-based contrastive learning. Appl Soft Comput 136:110125. https://doi.org/10.1016/j.asoc.2023.110125
Article Google Scholar
Wang J, Yan S, Xiong Y, Lin D (2020) Motion guided 3d pose estimation from videos. In: Proceedings of the European Conference on Computer Vision, pp. 764–780. Springer
Dave I, Gupta R, Rizve MN, Shah M (2022) Tclr: Temporal contrastive learning for video representation. Comput Vis Image Underst 219:103406
Article Google Scholar
Han T, Xie W, Zisserman A (2020) Self-supervised co-training for video representation learning. Adv Neural Inf Process Syst 33:5679–5690
Google Scholar
Han T, Xie W, Zisserman A (2020) Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312–329. Springer
Luo D, Liu C, Zhou Y, Yang D, Ma C, Ye Q, Wang W (2020) Video cloze procedure for self-supervised spatio-temporal learning. Proceedings of the AAAI Conference on Artificial Intelligence 34:11701–11708
Article Google Scholar
Xu D, Xiao J, Zhao Z, Shao J, Xie D, Zhuang Y (2019) Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10334–10343
Cui M, Wang W, Zhang K, Sun Z, Wang L (2022) Pose-appearance relational modeling for video action recognition. IEEE Transactions on Image Processing

Download references

Acknowledgements

This research was supported by the National Science Foundation of China (Nos. 62176221, 62276215, 62276216), Sichuan Science and Technology Program (No. MZGC20230073) and the Fundamental Research Funds for the Central Universities (No. 2682023ZT007).

Author information

Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh, Tianrui Li These authors are contributed equally to this work.

Authors and Affiliations

School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, People’s Republic of China
Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh & Tianrui Li
Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu, 611756, People’s Republic of China
Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh & Tianrui Li
National Engineering Laboratory of Integrated Transportation Big Data Application Technology, Southwest Jiaotong University, Chengdu, 611756, People’s Republic of China
Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh & Tianrui Li
Manufacturing Industry Chains Collaboration and Information Support Technology Key Laboratory of Sichuan Province, Southwest Jiaotong University, Chengdu, 611756, People’s Republic of China
Muhammad Hafeez Javed, Zeng Yu, Taha M. Rajeh & Tianrui Li
Stirling College, Chengdu University, Chengdu, 610106, People’s Republic of China
Muhammad Hafeez Javed & Taha M. Rajeh
School of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, People’s Republic of China
Fahad Rafique

Authors

Muhammad Hafeez Javed
View author publications
You can also search for this author in PubMed Google Scholar
Zeng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Taha M. Rajeh
View author publications
You can also search for this author in PubMed Google Scholar
Fahad Rafique
View author publications
You can also search for this author in PubMed Google Scholar
Tianrui Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Muhammad Hafeez Javed: Conceptualization, Methodology, Software, Validation, Visualization, Writing-Original draft preparation, Resources. Zeng Yu: Methodology, Revising and Original draft preparation, Resources. Taha M. Rajeh: Writing- Reviewing & Editing, Modeling, Supervision, Resources. Fahad Rafique: Writing- Reviewing & Editing, Conducting Experiments. Tianrui Li: Supervision, Conceptualization, Methodology, Software, Writing- Reviewing & Editing, Resources.

Corresponding author

Correspondence to Tianrui Li.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Ethical and Informed Consent for Data Used

The manuscript is not be submitted to more than one journal for simultaneous consideration.The submitted work is original and not have been published elsewhere in any form or language (partially or in full).A single study is not be split up into several parts to increase the quantity of submissions and submitted to various journals or to one journal over time (i.e. ?salamislicing/ publishing?).Concurrent or secondary publication is sometimes justifiable, provided certain conditions are met. Examples include: translations or a manuscript that is intended for a different group of readers.Results are presented clearly, honestly, and without fabrication, falsification or inappropriate data manipulation (including image based manipulation).No data, text, or theories by others are presented as if they were the author’s own (“plagiarism”).

Informed consent for data used

We make sure they have permissions for the use of software, questionnaires/(web) surveys and scales in their studies (if appropriate).Research articles and non-research articles (e.g. Opinion, Review, and Commentary articles) have cite appropriate and relevant literature in support of the claims made. We don?t have excessive and inappropriate self-citation or coordinated efforts among several authors to collectively self-cite.We have avoid untrue statements about an entity (who can be an individual person or a company) or descriptions of their behavior or actions that could potentially be seen as personal attacks or allegations about that person.Research don?t contain anything that may be misapplied to pose a threat to public health or national security.We ensure that the author group, the Corresponding Author, and the order of authors are all correct at submission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Javed, M.H., Yu, Z., Rajeh, T.M. et al. Clustering-based multi-featured self-supervised learning for human activities and video retrieval. Appl Intell 54, 6198–6212 (2024). https://doi.org/10.1007/s10489-024-05460-8

Download citation

Accepted: 10 April 2024
Published: 08 May 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10489-024-05460-8

Clustering-based multi-featured self-supervised learning for human activities and video retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Integrating pseudo labeling with contrastive clustering for transformer-based semi-supervised action recognition

GOCA: Guided Online Cluster Assignment for Self-supervised Video Representation Learning

Data Availability and Access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical and Informed Consent for Data Used

Informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Clustering-based multi-featured self-supervised learning for human activities and video retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Integrating pseudo labeling with contrastive clustering for transformer-based semi-supervised action recognition

GOCA: Guided Online Cluster Assignment for Self-supervised Video Representation Learning

Explore related subjects

Data Availability and Access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical and Informed Consent for Data Used

Informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation