Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3595916.3626423acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Feature Enhancement and Foreground-Background Separation for Weakly Supervised Temporal Action Localization

Published: 01 January 2024 Publication History

Abstract

Weakly-supervised Temporal Action Localization (W-TAL) is a significant task in video understanding, intending to recognize the category and pinpoint the temporal boundaries of action segments in untrimmed videos based on the video-level labels. Due to the lack of frame-level annotations, recognizing the spatiotemporal relationships among action snippets and precisely separating the foreground and background are two arduous challenges. In this paper, we propose a novel Feature Enhancement and Foreground-Background Separation (FE-FBS) method to address these problems. Specifically, we construct a spatiotemporal cooperation enhancement scheme utilizing residual graph convolutional to capture the spatial and temporal dependencies between the current snippet and others, generating snippet features with spatial and temporal correlations, thereby ensuring a complete feature representation for action localization. Furthermore, we propose a two-branch explicit foreground-background joint attention mechanism to guide foreground and background modeling, combined with an inverse enhancement strategy to enhance action weight for better foreground and background distinction, thereby improving the accuracy of action localization. We use THUMOS’14 and ActivityNet v1.3 datasets to achieve accuracy rates of 66.7% and 38.9%, respectively, and compare our method to other methods, gaining superiority.

References

[1]
Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition 77 (2018), 329–353.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6299–6308.
[3]
Yi Cheng, Ying Sun, Hehe Fan, Tao Zhuo, Joo-Hwee Lim, and Mohan Kankanhalli. 2022. Entropy guided attention network for weakly-supervised action localization. Pattern Recognition 129 (2022), 108718.
[4]
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, and Xun Wang. 2022. Partially Relevant Video Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 246–257.
[5]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International conference on computer vision. 6202–6211.
[6]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, 961–970.
[7]
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1–23.
[8]
Ashraful Islam, Chengjiang Long, and Richard Radke. 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1637–1645.
[9]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[10]
Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320–11327.
[11]
Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1298–1307.
[12]
Ziyi Liu, Le Wang, Qilin Zhang, Wei Tang, Junsong Yuan, Nanning Zheng, and Gang Hua. 2021. Acsnet: Action-context separation network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2233–2241.
[13]
Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 9969–9979.
[14]
Kyle Min and Jason J Corso. 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In Proceedings of the European Conference on Computer Vision. 283–299.
[15]
Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6752–6761.
[16]
Phuc Xuan Nguyen, Deva Ramanan, and Charless C Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502–5511.
[17]
Zhaoyang Niu, Guoqiang Zhong, and Hui Yu. 2021. A review on the attention mechanism of deep learning. Neurocomputing 452 (2021), 48–62.
[18]
Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision. 563–579.
[19]
Maheen Rashid, Hedvig Kjellstrom, and Yong Jae Lee. 2020. Action graphs: Weakly-supervised action localization with graph convolution networks. In Proceedings of the IEEE Winter conference on Applications of Computer Vision. 615–624.
[20]
Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1009–1019.
[21]
Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision. 154–171.
[22]
G Sreenu and Saleem Durai. 2019. Intelligent video surveillance: a review through deep learning techniques for crowd analysis. Journal of Big Data 6, 1 (2019), 1–27.
[23]
Che Sun, Yunde Jia, and Yuwei Wu. 2022. Evidential reasoning for video anomaly detection. In Proceedings of the 30th ACM International Conference on Multimedia. 2106–2114.
[24]
Chuanxu Wang, Jing Wang, and Peng Liu. 2023. Complementary adversarial mechanisms for weakly-supervised temporal action localization. Pattern Recognition 139 (2023), 109426.
[25]
Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 10156–10165.
[26]
Yuan Yuan, Yueming Lyu, Xi Shen, Ivor W Tsang, and Dit-Yan Yeung. 2019. Marginalized average attentional network for weakly-supervised learning. arXiv preprint arXiv:1905.08586 (2019).
[27]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094–7103.
[28]
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 37–54.
[29]
Xiao-Yu Zhang, Haichao Shi, Changsheng Li, and Xinchu Shi. 2022. Action shuffling for weakly supervised temporal localization. IEEE Transactions on Image Processing 31 (2022), 4447–4457.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
December 2023
745 pages
ISBN:9798400702051
DOI:10.1145/3595916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Explicit attention modeling
  2. Inverse enhancement strategy
  3. Spatiotemporal cooperation enhancement
  4. Temporal action localization
  5. Weakly-supervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • the National Natural Science Foundation of China

Conference

MMAsia '23
Sponsor:
MMAsia '23: ACM Multimedia Asia
December 6 - 8, 2023
Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 64
    Total Downloads
  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)5
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media