Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Clustering-based multi-featured self-supervised learning for human activities and video retrieval

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human-centric content-based video retrieval has emerged as a prominent area of research due to its diverse applications. However, this task presents several inherent challenges, including end-to-end image classification and data sampling. Despite the significant progress made by self-supervised learning methods in addressing these challenges, there are still some issues that need to be addressed. Among those, one major concern is the generation of randomly sampled inverse-complementary pairs. The process of generating such pairs requires careful handling to avoid false positives. Moreover, a common assumption that the similarity between video clips is solely temporal neglects the role of other factors, such as motion. To address this issue, a clustering-based multi-featured self-supervised learning model called CMS2L is proposed in this paper. Our model introduces a fundamental improvement by fixing intra-class positive sampling to avoid false labeling during stage training due to looping clusters. Additionally, it employs a second stream with an expanded range of features to achieve a more comprehensive representation of actions. Experimental results on benchmark datasets demonstrate the superiority of our proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability and Access

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.

References

  1. Ramasamy Ramamurthy S, Roy N (2018) Recent trends in machine learning for human activity recognition-a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4):1254

    Google Scholar 

  2. Jangir MK, Singh K (2019) Hargrurnn: Human activity recognition using inertial body sensor gated recurrent units recurrent neural network. Journal of Discrete Mathematical Sciences and Cryptography 22(8):1577–1587

    Article  Google Scholar 

  3. Saini R, Kumar P, Roy PP, Dogra DP (2018) A novel framework of continuous human-activity recognition using kinect. Neurocomputing 311:99–111

    Article  Google Scholar 

  4. Javed MH, Yu Z, Li T, Rajeh TM, Rafique F, Waqar S (2022) Hybrid two-stream dynamic cnn for view adaptive human action recognition using ensemble learning. Int J Mach Learn Cybern 13(4):1157–1166

    Article  Google Scholar 

  5. Barbosa R, Ogobuchi OD, Joy OO, Saadi M, Rosa RL, Otaibi SA, Rodríguez DZ (2023) Iot based real-time traffic monitoring system using images sensors by sparse deep learning algorithm. Comput Commun 210:321–330. https://doi.org/10.1016/j.comcom.2023.08.007

    Article  Google Scholar 

  6. Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058

    Article  Google Scholar 

  7. Koupai AK, Bocus MJ, Santos-Rodriguez R, Piechocki RJ, McConville R (2022) Self-supervised multimodal fusion transformer for passive activity recognition. IET Wireless Sensor Systems

  8. Teng Y, Song C, Wu B (2022) Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process Lett 29:1377–1381

    Article  Google Scholar 

  9. Chang S, Li Y, Shen S, Feng J, Zhou Z (2021) Contrastive attention for video anomaly detection. IEEE Trans Multimedia 24:4067–4076

    Article  Google Scholar 

  10. Xi L, Yun Z, Liu H, Wang R, Huang X, Fan H (2022) Semi-supervised time series classification model with self-supervised learning. Eng Appl Artif Intell 116:105331. https://doi.org/10.1016/j.engappai.2022.105331

    Article  Google Scholar 

  11. Saeed A, Salim FD, Ozcelebi T, Lukkien J (2020) Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Internet Things J 8(2):1030–1040

    Article  Google Scholar 

  12. Li P, Cao J, Ye X (2023) Prototype contrastive learning for point-supervised temporal action detection. Expert Syst Appl 213:118965

    Article  Google Scholar 

  13. Zhang H, Zhao S, Qiang W, Chen Y, Jing L (2022) Feature extraction framework based on contrastive learning with adaptive positive and negative samples. Neural Netw 156:244–257

    Article  Google Scholar 

  14. Zhang X, Li Q, Quan Z, Yang W (2023) Pyramid geometric consistency learning for semantic segmentation. Pattern Recogn 133:109020

    Article  Google Scholar 

  15. Wang Z, Lyu J, Luo W, Tang X (2022) Superpixel inpainting for self-supervised skin lesion segmentation from dermoscopic images. In: Proceedings of the International Symposium on Biomedical Imaging (ISBI), pp. 1–4. IEEE

  16. Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035

  17. Pan T, Song Y, Yang T, Jiang W, Liu W (2021) Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214

  18. Liu X, Li Y, Xia R (2020) Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition. SIViP 14(6):1227–1234

    Article  Google Scholar 

  19. Wan Y, Yu Z, Wang Y, Li X (2020) Action recognition based on two-stream convolutional networks with long-short-term spatiotemporal features. IEEE Access 8:85284–85293

    Article  Google Scholar 

  20. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28

  21. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628

    Article  Google Scholar 

  22. Han Y, Hui L, Jiang H, Qian J, Xie J (2022) Generative subgraph contrast for self-supervised graph representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 91–107. Springer

  23. Zhang R, Luo Y, Ma J, Zhang M, Wang S (2022) scpretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 38(6):1607–1614

    Article  Google Scholar 

  24. Huang L, Liu Y, Wang B, Pan P, Xu Y, Jin R (2021) Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.13886–13895

  25. Li Y, Chen J, Li F, Fu B, Wu H, Ji Y, Zhou Y, Niu Y, Shi G, Zheng W (2022) Gmss: Graph-based multi-task self-supervised learning for eeg emotion recognition. IEEE Transactions on Affective Computing

  26. Zhang P, Zhou L, Bai X, Wang C, Zhou J, Zhang L, Zheng J (2022) Learning multi-view visual correspondences with self-supervision. Displays 72:102160

    Article  Google Scholar 

  27. Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666. Springer

  28. Ma Z, Li K, Li Y (2023) Self-supervised method for 3d human pose estimation with consistent shape and viewpoint factorization. Applied Intelligence 3864–3876

  29. Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060

  30. Benaim S, Ephrat A, Lang O, Mosseri I, Freeman WT, Rubinstein M, Irani M, Dekel T (2020) Speednet: Learning the speediness in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9922–9931

  31. Yao Y, Liu C, Luo D, Zhou Y, Ye Q (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557

  32. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Advances in Neural Information Processing Systems 29

  33. Lin S, Liu C, Zhou P, Hu ZY, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X (2022) Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems

  34. Liu Y, Wang K, Liu L, Lan H, Lin L (2022) Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Trans Image Process 31:1978–1993

    Article  Google Scholar 

  35. Liu Y, Ma J, Xie Y, Yang X, Tao X, Peng L, Gao W (2022) Contrastive predictive coding with transformer for video representation learning. Neurocomputing 482:154–162

    Article  Google Scholar 

  36. Yu Z, Wang J, Lu G (2019) Optimized self-adapting contrast enhancement algorithm for wafer contour extraction. Multimedia Tools and Applications 78:32087–32108

    Article  Google Scholar 

  37. Zhang W, Deng Z, Zhang T, Choi KS, Wang S (2023) Multi-view fuzzy representation learning with rules based model. IEEE Transactions on Knowledge and Data Engineering

  38. Farhat M, Chaabouni-Chouayakh H, Ben-Hamadou A (2023) Self-supervised endoscopic image key-points matching. Expert Syst Appl 213:118696

    Article  Google Scholar 

  39. Grill JB, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi Azar M et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Adv Neural Inf Process Syst 33:21271–21284

    Google Scholar 

  40. Feichtenhofer C, Fan H, Xiong B, Girshick R, He K (2021) A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309

  41. Ju J, Jung H, Oh Y, Kim J (2022) Extending contrastive learning to unsupervised coreset selection. IEEE Access 10:7704–7715

    Article  Google Scholar 

  42. Ye J, Xiao Q, Wang J, Zhang H, Deng J, Lin Y (2021) Cosleep: A multi-view representation learning framework for self-supervised learning of sleep stage classification. IEEE Signal Process Lett 29:189–193

    Article  Google Scholar 

  43. Liu S, Sehgal N, Ostadabbas S (2022) Adapted human pose: monocular 3d human pose estimation with zero real 3d pose data. Appl Intell 52(12):14491–14506

    Article  Google Scholar 

  44. Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149

  45. Xie J, Zhan X, Liu Z, Ong YS, Loy CC (2021) Unsupervised object-level representation learning from scene images. Adv Neural Inf Process Syst 34:28864–28876

    Google Scholar 

  46. Yan X, Misra I, Gupta A, Ghadiyaram D, Mahajan D (2020) Clusterfit: Improving generalization of visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6509–6518

  47. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924

    Google Scholar 

  48. Ntelemis F, Jin Y, Thomas SA (2022) Information maximization clustering via multi-view self-labelling. Knowledge-Based Systems 109042

  49. Zhu Y, Shuai H, Liu G, Liu Q (2022) Self-supervised video representation learning using improved instance-wise contrastive learning and deep clustering. IEEE Transactions on Circuits and Systems for Video Technology

  50. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297

  51. Khorasgani SH, Chen Y, Shkurti F (2022) Slic: Self-supervised learning with iterative clustering for human action videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101

  52. Kumar V, Tripathi V, Pant B (2022) Learning unsupervised visual representations using 3d convolutional autoencoder with temporal contrastive modeling for video retrieval. International Journal of Mathematical, Engineering and Management Sciences

  53. Zatsarynna O, Farha YA, Gall J (2022) Self-supervised learning for unintentional action prediction. In: Proceedings of the DAGM German Conference on Pattern Recognition, pp. 429–444. Springer

  54. Hua J, Cui X, Li X, Tang K, Zhu P (2023) Multimodal fake news detection through data augmentation-based contrastive learning. Appl Soft Comput 136:110125. https://doi.org/10.1016/j.asoc.2023.110125

    Article  Google Scholar 

  55. Wang J, Yan S, Xiong Y, Lin D (2020) Motion guided 3d pose estimation from videos. In: Proceedings of the European Conference on Computer Vision, pp. 764–780. Springer

  56. Dave I, Gupta R, Rizve MN, Shah M (2022) Tclr: Temporal contrastive learning for video representation. Comput Vis Image Underst 219:103406

    Article  Google Scholar 

  57. Han T, Xie W, Zisserman A (2020) Self-supervised co-training for video representation learning. Adv Neural Inf Process Syst 33:5679–5690

    Google Scholar 

  58. Han T, Xie W, Zisserman A (2020) Memory-augmented dense predictive coding for video representation learning. In: Proceedings of the European Conference on Computer Vision, pp. 312–329. Springer

  59. Luo D, Liu C, Zhou Y, Yang D, Ma C, Ye Q, Wang W (2020) Video cloze procedure for self-supervised spatio-temporal learning. Proceedings of the AAAI Conference on Artificial Intelligence 34:11701–11708

    Article  Google Scholar 

  60. Xu D, Xiao J, Zhao Z, Shao J, Xie D, Zhuang Y (2019) Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10334–10343

  61. Cui M, Wang W, Zhang K, Sun Z, Wang L (2022) Pose-appearance relational modeling for video action recognition. IEEE Transactions on Image Processing

Download references

Acknowledgements

This research was supported by the National Science Foundation of China (Nos. 62176221, 62276215, 62276216), Sichuan Science and Technology Program (No. MZGC20230073) and the Fundamental Research Funds for the Central Universities (No. 2682023ZT007).

Author information

Authors and Affiliations

Authors

Contributions

Muhammad Hafeez Javed: Conceptualization, Methodology, Software, Validation, Visualization, Writing-Original draft preparation, Resources. Zeng Yu: Methodology, Revising and Original draft preparation, Resources. Taha M. Rajeh: Writing- Reviewing & Editing, Modeling, Supervision, Resources. Fahad Rafique: Writing- Reviewing & Editing, Conducting Experiments. Tianrui Li: Supervision, Conceptualization, Methodology, Software, Writing- Reviewing & Editing, Resources.

Corresponding author

Correspondence to Tianrui Li.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Ethical and Informed Consent for Data Used

The manuscript is not be submitted to more than one journal for simultaneous consideration.The submitted work is original and not have been published elsewhere in any form or language (partially or in full).A single study is not be split up into several parts to increase the quantity of submissions and submitted to various journals or to one journal over time (i.e. ?salamislicing/ publishing?).Concurrent or secondary publication is sometimes justifiable, provided certain conditions are met. Examples include: translations or a manuscript that is intended for a different group of readers.Results are presented clearly, honestly, and without fabrication, falsification or inappropriate data manipulation (including image based manipulation).No data, text, or theories by others are presented as if they were the author’s own (“plagiarism”).

Informed consent for data used

We make sure they have permissions for the use of software, questionnaires/(web) surveys and scales in their studies (if appropriate).Research articles and non-research articles (e.g. Opinion, Review, and Commentary articles) have cite appropriate and relevant literature in support of the claims made. We don?t have excessive and inappropriate self-citation or coordinated efforts among several authors to collectively self-cite.We have avoid untrue statements about an entity (who can be an individual person or a company) or descriptions of their behavior or actions that could potentially be seen as personal attacks or allegations about that person.Research don?t contain anything that may be misapplied to pose a threat to public health or national security.We ensure that the author group, the Corresponding Author, and the order of authors are all correct at submission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Javed, M.H., Yu, Z., Rajeh, T.M. et al. Clustering-based multi-featured self-supervised learning for human activities and video retrieval. Appl Intell 54, 6198–6212 (2024). https://doi.org/10.1007/s10489-024-05460-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05460-8

Keywords

Navigation