Abstract
Human-centric content-based video retrieval has emerged as a prominent area of research due to its diverse applications. However, this task presents several inherent challenges, including end-to-end image classification and data sampling. Despite the significant progress made by self-supervised learning methods in addressing these challenges, there are still some issues that need to be addressed. Among those, one major concern is the generation of randomly sampled inverse-complementary pairs. The process of generating such pairs requires careful handling to avoid false positives. Moreover, a common assumption that the similarity between video clips is solely temporal neglects the role of other factors, such as motion. To address this issue, a clustering-based multi-featured self-supervised learning model called CMS2L is proposed in this paper. Our model introduces a fundamental improvement by fixing intra-class positive sampling to avoid false labeling during stage training due to looping clusters. Additionally, it employs a second stream with an expanded range of features to achieve a more comprehensive representation of actions. Experimental results on benchmark datasets demonstrate the superiority of our proposed model.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability and Access
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.
References
Ramasamy Ramamurthy S, Roy N (2018) Recent trends in machine learning for human activity recognition-a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4):1254
Jangir MK, Singh K (2019) Hargrurnn: Human activity recognition using inertial body sensor gated recurrent units recurrent neural network. Journal of Discrete Mathematical Sciences and Cryptography 22(8):1577–1587
Saini R, Kumar P, Roy PP, Dogra DP (2018) A novel framework of continuous human-activity recognition using kinect. Neurocomputing 311:99–111
Javed MH, Yu Z, Li T, Rajeh TM, Rafique F, Waqar S (2022) Hybrid two-stream dynamic cnn for view adaptive human action recognition using ensemble learning. Int J Mach Learn Cybern 13(4):1157–1166
Barbosa R, Ogobuchi OD, Joy OO, Saadi M, Rosa RL, Otaibi SA, Rodríguez DZ (2023) Iot based real-time traffic monitoring system using images sensors by sparse deep learning algorithm. Comput Commun 210:321–330. https://doi.org/10.1016/j.comcom.2023.08.007
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058
Koupai AK, Bocus MJ, Santos-Rodriguez R, Piechocki RJ, McConville R (2022) Self-supervised multimodal fusion transformer for passive activity recognition. IET Wireless Sensor Systems
Teng Y, Song C, Wu B (2022) Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process Lett 29:1377–1381
Chang S, Li Y, Shen S, Feng J, Zhou Z (2021) Contrastive attention for video anomaly detection. IEEE Trans Multimedia 24:4067–4076
Xi L, Yun Z, Liu H, Wang R, Huang X, Fan H (2022) Semi-supervised time series classification model with self-supervised learning. Eng Appl Artif Intell 116:105331. https://doi.org/10.1016/j.engappai.2022.105331
Saeed A, Salim FD, Ozcelebi T, Lukkien J (2020) Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Internet Things J 8(2):1030–1040
Li P, Cao J, Ye X (2023) Prototype contrastive learning for point-supervised temporal action detection. Expert Syst Appl 213:118965
Zhang H, Zhao S, Qiang W, Chen Y, Jing L (2022) Feature extraction framework based on contrastive learning with adaptive positive and negative samples. Neural Netw 156:244–257
Zhang X, Li Q, Quan Z, Yang W (2023) Pyramid geometric consistency learning for semantic segmentation. Pattern Recogn 133:109020
Wang Z, Lyu J, Luo W, Tang X (2022) Superpixel inpainting for self-supervised skin lesion segmentation from dermoscopic images. In: Proceedings of the International Symposium on Biomedical Imaging (ISBI), pp. 1–4. IEEE
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035
Pan T, Song Y, Yang T, Jiang W, Liu W (2021) Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214
Liu X, Li Y, Xia R (2020) Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition. SIViP 14(6):1227–1234
Wan Y, Yu Z, Wang Y, Li X (2020) Action recognition based on two-stream convolutional networks with long-short-term spatiotemporal features. IEEE Access 8:85284–85293
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28
Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628
Han Y, Hui L, Jiang H, Qian J, Xie J (2022) Generative subgraph contrast for self-supervised graph representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 91–107. Springer
Zhang R, Luo Y, Ma J, Zhang M, Wang S (2022) scpretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 38(6):1607–1614
Huang L, Liu Y, Wang B, Pan P, Xu Y, Jin R (2021) Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.13886–13895
Li Y, Chen J, Li F, Fu B, Wu H, Ji Y, Zhou Y, Niu Y, Shi G, Zheng W (2022) Gmss: Graph-based multi-task self-supervised learning for eeg emotion recognition. IEEE Transactions on Affective Computing
Zhang P, Zhou L, Bai X, Wang C, Zhou J, Zhang L, Zheng J (2022) Learning multi-view visual correspondences with self-supervision. Displays 72:102160
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666. Springer
Ma Z, Li K, Li Y (2023) Self-supervised method for 3d human pose estimation with consistent shape and viewpoint factorization. Applied Intelligence 3864–3876
Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060
Benaim S, Ephrat A, Lang O, Mosseri I, Freeman WT, Rubinstein M, Irani M, Dekel T (2020) Speednet: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931
Yao Y, Liu C, Luo D, Zhou Y, Ye Q (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural Information Processing Systems 29
Lin S, Liu C, Zhou P, Hu ZY, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X (2022) Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems
Liu Y, Wang K, Liu L, Lan H, Lin L (2022) Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Trans Image Process 31:1978–1993
Liu Y, Ma J, Xie Y, Yang X, Tao X, Peng L, Gao W (2022) Contrastive predictive coding with transformer for video representation learning. Neurocomputing 482:154–162
Yu Z, Wang J, Lu G (2019) Optimized self-adapting contrast enhancement algorithm for wafer contour extraction. Multimedia Tools and Applications 78:32087–32108
Zhang W, Deng Z, Zhang T, Choi KS, Wang S (2023) Multi-view fuzzy representation learning with rules based model. IEEE Transactions on Knowledge and Data Engineering
Farhat M, Chaabouni-Chouayakh H, Ben-Hamadou A (2023) Self-supervised endoscopic image key-points matching. Expert Syst Appl 213:118696
Grill JB, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Adv Neural Inf Process Syst 33:21271–21284
Feichtenhofer C, Fan H, Xiong B, Girshick R, He K (2021) A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309
Ju J, Jung H, Oh Y, Kim J (2022) Extending contrastive learning to unsupervised coreset selection. IEEE Access 10:7704–7715
Ye J, Xiao Q, Wang J, Zhang H, Deng J, Lin Y (2021) Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Process Lett 29:189–193
Liu S, Sehgal N, Ostadabbas S (2022) Adapted human pose: monocular 3d human pose estimation with zero real 3d pose data. Appl Intell 52(12):14491–14506
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149
Xie J, Zhan X, Liu Z, Ong YS, Loy CC (2021) Unsupervised object-level representation learning from scene images. Adv Neural Inf Process Syst 34:28864–28876
Yan X, Misra I, Gupta A, Ghadiyaram D, Mahajan D (2020) Clusterfit: Improving generalization of visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6509–6518
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Ntelemis F, Jin Y, Thomas SA (2022) Information maximization clustering via multi-view self-labelling. Knowledge-Based Systems 109042
Zhu Y, Shuai H, Liu G, Liu Q (2022) Self-supervised video representation learning using improved instance-wise contrastive learning and deep clustering. IEEE Transactions on Circuits and Systems for Video Technology
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297
Khorasgani SH, Chen Y, Shkurti F (2022) Slic: Self-supervised learning with iterative clustering for human action videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101
Kumar V, Tripathi V, Pant B (2022) Learning unsupervised visual representations using 3d convolutional autoencoder with temporal contrastive modeling for video retrieval. International Journal of Mathematical, Engineering and Management Sciences
Zatsarynna O, Farha YA, Gall J (2022) Self-supervised learning for unintentional action prediction. In: Proceedings of the DAGM German Conference on Pattern Recognition, pp. 429–444. Springer
Hua J, Cui X, Li X, Tang K, Zhu P (2023) Multimodal fake news detection through data augmentation-based contrastive learning. Appl Soft Comput 136:110125. https://doi.org/10.1016/j.asoc.2023.110125
Wang J, Yan S, Xiong Y, Lin D (2020) Motion guided 3d pose estimation from videos. In: Proceedings of the European Conference on Computer Vision, pp. 764–780. Springer
Dave I, Gupta R, Rizve MN, Shah M (2022) Tclr: Temporal contrastive learning for video representation. Comput Vis Image Underst 219:103406
Han T, Xie W, Zisserman A (2020) Self-supervised co-training for video representation learning. Adv Neural Inf Process Syst 33:5679–5690
Han T, Xie W, Zisserman A (2020) Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312–329. Springer
Luo D, Liu C, Zhou Y, Yang D, Ma C, Ye Q, Wang W (2020) Video cloze procedure for self-supervised spatio-temporal learning. Proceedings of the AAAI Conference on Artificial Intelligence 34:11701–11708
Xu D, Xiao J, Zhao Z, Shao J, Xie D, Zhuang Y (2019) Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10334–10343
Cui M, Wang W, Zhang K, Sun Z, Wang L (2022) Pose-appearance relational modeling for video action recognition. IEEE Transactions on Image Processing
Acknowledgements
This research was supported by the National Science Foundation of China (Nos. 62176221, 62276215, 62276216), Sichuan Science and Technology Program (No. MZGC20230073) and the Fundamental Research Funds for the Central Universities (No. 2682023ZT007).
Author information
Authors and Affiliations
Contributions
Muhammad Hafeez Javed: Conceptualization, Methodology, Software, Validation, Visualization, Writing-Original draft preparation, Resources. Zeng Yu: Methodology, Revising and Original draft preparation, Resources. Taha M. Rajeh: Writing- Reviewing & Editing, Modeling, Supervision, Resources. Fahad Rafique: Writing- Reviewing & Editing, Conducting Experiments. Tianrui Li: Supervision, Conceptualization, Methodology, Software, Writing- Reviewing & Editing, Resources.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Ethical and Informed Consent for Data Used
The manuscript is not be submitted to more than one journal for simultaneous consideration.The submitted work is original and not have been published elsewhere in any form or language (partially or in full).A single study is not be split up into several parts to increase the quantity of submissions and submitted to various journals or to one journal over time (i.e. ?salamislicing/ publishing?).Concurrent or secondary publication is sometimes justifiable, provided certain conditions are met. Examples include: translations or a manuscript that is intended for a different group of readers.Results are presented clearly, honestly, and without fabrication, falsification or inappropriate data manipulation (including image based manipulation).No data, text, or theories by others are presented as if they were the author’s own (“plagiarism”).
Informed consent for data used
We make sure they have permissions for the use of software, questionnaires/(web) surveys and scales in their studies (if appropriate).Research articles and non-research articles (e.g. Opinion, Review, and Commentary articles) have cite appropriate and relevant literature in support of the claims made. We don?t have excessive and inappropriate self-citation or coordinated efforts among several authors to collectively self-cite.We have avoid untrue statements about an entity (who can be an individual person or a company) or descriptions of their behavior or actions that could potentially be seen as personal attacks or allegations about that person.Research don?t contain anything that may be misapplied to pose a threat to public health or national security.We ensure that the author group, the Corresponding Author, and the order of authors are all correct at submission.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Javed, M.H., Yu, Z., Rajeh, T.M. et al. Clustering-based multi-featured self-supervised learning for human activities and video retrieval. Appl Intell 54, 6198–6212 (2024). https://doi.org/10.1007/s10489-024-05460-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05460-8