Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Weakly supervised temporal action localization (W-TAL) is designed to detect and classify all the action instances in an untrimmed video with only video-level labels. Due to the lack of frame-level annotations, the correlations learning between action snippets and the separation between action and background are the two key issues for accurate action localization. To mine the intrinsic correlations of space and time embodied in the occurrences of action in a video and identify the action and background in the snippets, a novel method based on spatial–temporal correlations learning and action-background jointed attention for W-TAL is proposed. In this method, the graph convolution and 1-D temporal convolution networks are constructed to learn the spatial and temporal features of the video, respectively, then fused to learn a fruitful spatial–temporal correlative feature map. This ensures more completed features representation for action localization. Next, different from the other methods, action-background jointed attention mechanism is presented to explicitly modelled background as well as action in a three-branch classification network. This classification network can distinguish action and background and realize the separation of action and background better, so as to promote more accurate action localization. Experiments conducted on Thumos14 and ActivityNet1.3 show that our method outperforms state-of-the-art methods, especially at some high t-IoU thresholds, which further validates the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Fabian, C.H., Victor, E., Bernard, G., Juan, C.N.: Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970 (2015)

  2. Joao, C., Andrew, Z.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308 (2017)

  3. Yu-Wei, C., Sudheendra, V., Bryan, S., Ross, D.A., Jia, D., Rahul, S.: Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1130–1139 (2018)

  4. Christoph, F., Haoqi, F., Jitendra, M., Kaiming, H.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211 (2019)

  5. Ge, Y., Qin, X., Yang, D., Jagersand, M.: Deep snippet selective network for weakly supervised temporal action localization. Pattern Recogn. 110, 107686 (2021)

    Article  Google Scholar 

  6. Hang, Z., Yongzhao, Z., Qirong, M.: Video anomaly detection based on space–time fusion graph network learning. J. Comput. Res. Dev. 58(1), 48 (2021)

    Google Scholar 

  7. Huang, L., Huang, Y., Ouyang, W., Wang, L.: Relational prototypical network for weakly supervised temporal action localization. Proc. AAAI Conf. Artif. Intell. 34, 11053–11060 (2020)

    Google Scholar 

  8. Jiang, Y.-G.., Liu, J.: A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. Action recognition with a large number of classes, Thumos challenge (2014)

  9. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016

  10. Pilhyeon, L., Youngjung, U., Hyeran, B.: Background suppression network for weakly-supervised temporal action localization. In AAAI, pages 11320–11327, 2020

  11. Tianwei, L., Xu, Z., Zheng, S.: Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 988–996 (2017)

  12. Daochang, L., Tingting, J. Yizhou, W.: Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019

  13. Fuchen, L., Ting, Y., Zhaofan, Q., Xinmei, T., Jiebo, L. Tao, M.: Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 344–353 (2019)

  14. Ma, Y.-F., Hua, X.-S., Lie, L., Zhang, H.-J.: A generic framework of user attention model and its application in video summarization. IEEE Trans. Multimedia 7(5), 907–919 (2005)

    Article  Google Scholar 

  15. Phuc, N., Ting, L., Gautam, P., Bohyung, H.: Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6752–6761 (2018)

  16. Phuc, X.N., Deva, R., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision, pages 5502–5511 (2019)

  17. Sujoy ,P., Sourya, R., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp 563–579 (2018)

  18. Maheen, R., Hedvig, K., Lee, Y.J. Action graphs: Weakly-supervised action localization with graph convolution networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 615–624 (2020)

  19. Baifeng, S., Qi, D., Yadong, M., Jingdong, W.: Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1009–1019 (2020)

  20. Lei, S., Yifan, Z., Jian, C., Hanqing, L.: Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921 (2019)

  21. Zheng, S., Hang, G., Lei, Z., Kazuyuki, M., Shih-Fu, C.: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171 (2018)

  22. Zheng, S., Dongang, W., Shih-Fu, C.: Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1049–1058 (2016)

  23. Waqas, S., Chen, C., Mubarak, S.: Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488 (2018)

  24. Du, T., Lubomir, B., Rob, F., Lorenzo, T., Manohar, P.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497 (2015)

  25. Heng, W., Cordelia, S.: Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558 (2013)

  26. Limin, W., Yuanjun, X., Dahua, L., Luc, V.G.: Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4325–4334 (2017)

  27. Limin, W., Yuanjun, X., Zhe, W., Yu, Q., Dahua, L., Xiaoou, T., Luc, V.G.: Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016

  28. Andreas, W., Thomas, P., Christopher, Z., Horst, B., Daniel, C.: An improved algorithm for tv-l 1 optical flow. In Statistical and geometrical approaches to visual motion analysis, pages 23–45. Springer (2009)

  29. Mengmeng, X., Chen, Z. David, S.R., Ali, T., Bernard, G.: G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020

  30. Zeng, R., Gan, C., Chen, P., Huang, W., Qingyao, W., Tan, M.: Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans. Image Process. 28(12), 5797–5808 (2019)

    Article  MathSciNet  Google Scholar 

  31. Runhao, Z., Wenbing, H., Mingkui, T., Yu, R., Peilin, Z., Junzhou, H., Chuang, G.: Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 7094–7103 (2019)

  32. Yuanhao, Z., Le, W., Wei, T., Qilin, Z., Junsong, Y., Gang, H.: Two-stream consensus network for weakly-supervised temporal action localization. In European conference on computer vision, pages 37–54. Springer (2020)

  33. Peisen, Z., Lingxi, X., Chen, J., Ya, Z., Yanfeng, W., Qi, T.: Bottom-up temporal action localization with mutual regularization. In European Conference on Computer Vision, pages 539–555. Springer (2020)

  34. Yue, Z., Yuanjun, X., Limin, W., Zhirong, W., Xiaoou, T., Dahua, L.: Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923 (2017)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No.61672268).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongzhao Zhan.

Additional information

Communicated by B-K Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, H., Zhan, Y. & Cheng, K. Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization. Multimedia Systems 28, 1529–1541 (2022). https://doi.org/10.1007/s00530-022-00912-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00912-y

Keywords

Navigation