Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

Published: 25 February 2023 Publication History

Abstract

Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. Those works are mostly based on the Multiple Instance Learning framework, where temporal pooling is an indispensable part that usually relies on the guidance of snippet-level Class Activation Sequences (CAS). However, we observe that previous works only leverage a simple convolutional neural network for the generation of CAS, which ignores the weak discriminative foreground action segments and the background ones, and meanwhile, the relationship between different actions has not been considered. To solve this problem, we propose multiple temporal pooling mechanisms (MTP) for a more sufficient information utilization. Specifically, with the design of the Foreground Variance Branch, Dual Foreground Attention Branch and Hybrid Attention Fine-tuning Branch, MTP can leverage more effective information from different aspects and generate different CASs to guide the learning of temporal pooling. Moreover, different loss functions are designed for a better optimization of individual branches, aiming to effectively distinguish the action from the background. Our method shows excellent results on the THUMOS14 and ActivityNet1.2 datasets.

References

[1]
Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association.
[2]
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2911–2920.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[5]
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.
[6]
Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 2723–2733.
[7]
Peng Dou, Wei Zhou, Zhongke Liao, and Haifeng Hu. 2021. Feature matching network for weakly-supervised temporal action localization. In Pattern Recognition and Computer Vision. 459–471.
[8]
G. Gong, X. Wang, Y. Mu, and Q. Tian. 2020. Learning temporal co-attention models for unsupervised video action localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
[9]
C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, and R. Sukthankar. 2017. AVA: A video dataset of spatio-temporally localized atomic visual actions. (2017).
[10]
B. K. P. Horn and B. G. Schunck. 1981. Determining optical flow. In Techniques and Applications of Image Understanding.
[11]
Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11053–11060.
[12]
Linjiang Huang, Yan Huang, Wanli Ouyang, and Liang Wang. 2021. Modeling sub-actions for weakly supervised temporal action localization. IEEE Transactions on Image Processing 30 (2021), 5154–5167.
[13]
Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1–23.
[14]
Ashraful Islam, Chengjiang Long, and Richard Radke. 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1637–1645.
[15]
A. Islam and R. J. Radke. 2020. Weakly supervised temporal action localization using deep metric learning. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV’20).
[16]
Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-supervised temporal action localization via cross-stream collaborative learning. In Proceedings of the 29th ACM International Conference on Multimedia. 853–861.
[17]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.
[18]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, and A. Zisserman. 2017. The kinetics human action video dataset. (2017).
[19]
Krishna Kumar Singh and Yong Jae Lee. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision. 3524–3533.
[20]
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.
[21]
Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320–11327.
[22]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3–19.
[23]
D. Liu, T. Jiang, and Y. Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition.
[24]
Z. Liu, L. Wang, Q. Zhang, Z. Gao, Z. Niu, N Zheng, and G. Hua. 2020. Weakly supervised temporal action localization through contrast based evaluation networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’20).
[25]
Z. Liu, L. Wang, Q. Zhang, W. Tang, J. Yuan, N. Zheng, and G. Hua. 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. (2021).
[26]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 344–353.
[27]
Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9969–9979.
[28]
Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In European Conference on Computer Vision. Springer, 729–745.
[29]
Kyle Min and Jason J. Corso. 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In European Conference on Computer Vision. Springer, 283–299.
[30]
Md. Moniruzzaman, Zhaozheng Yin, Zhihai He, Ruwen Qin, and Ming C. Leu. 2020. Action completeness modeling with background aware networks for weakly-supervised temporal action localization. In Proceedings of the 28th ACM International Conference on Multimedia. 2166–2174.
[31]
Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3C-Net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8679–8687.
[32]
Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.
[33]
Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5502–5511.
[34]
Alejandro Pardo, Humam Alwassel, Fabian Caba, Ali Thabet, and Bernard Ghanem. 2021. RefineLoc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3319–3328.
[35]
Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. 2018. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[36]
Xiaolei Qin, Yongxin Ge, Hui Yu, Feiyu Chen, and Dan Yang. 2020. Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Signal Processing Letters 27 (2020), 1520–1524. DOI:
[37]
Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1019.
[38]
Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).
[39]
Z. Shou, H. Gao, L. Zhang, Kazuyuki Mayazawa, and Shih-Fu Chang. 2018. AutoLoc: Weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[40]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.
[41]
G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Springer, Cham.
[42]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27 (2014).
[43]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[44]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450–6459.
[45]
Binglu Wang, Le Yang, and Yongqiang Zhao. 2021. POLO: Learning explicit cross-modality fusion for temporal action localization. IEEE Signal Processing Letters 28 (2021), 503–507. DOI:
[46]
Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.
[47]
Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.
[48]
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156–10165.
[49]
Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S. Davis, and Jan Kautz. 2019. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 264–272.
[50]
Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678–2687.
[51]
C. Yin, Z. Liao, H. Hu, and D. Chen. 2019. Weakly supervised video action localisation via two-stream action activation network. Electronics Letters 55, 21 (2019), 1126–1127.
[52]
Tan Yu, Zhou Ren, Yuncheng Li, Enxu Yan, Ning Xu, and Junsong Yuan. 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5522–5531.
[53]
Yuan Yuan, Yueming Lyu, Xi Shen, Ivor Tsang, and Dit-Yan Yeung. 2019. Marginalized average attentional network for weakly-supervised learning. In ICLR 2019-Seventh International Conference on Learning Representations.
[54]
Runhao Zeng, Chuang Gan, Peihao Chen, Wenbing Huang, Qingyao Wu, and Mingkui Tan. 2019. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Transactions on Image Processing 28, 12 (2019), 5797–5808.
[55]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7094–7103.
[56]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2021. Graph convolutional module for temporal action localization in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[57]
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In European Conference on Computer Vision. Springer, 37–54.
[58]
Peisen Zhao, Lingxi Xie, Ya Zhang, and Qi Tian. 2021. Actionness-guided transformer for anchor-free temporal action localization. IEEE Signal Processing Letters (2021), 1–1. DOI:
[59]
Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914–2923.

Cited By

View all
  • (2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 8-Mar-2024
  • (2024)Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.335562826(6717-6729)Online publication date: 2024

Index Terms

  1. Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
    May 2023
    514 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3582886
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2023
    Online AM: 13 October 2022
    Accepted: 03 October 2022
    Revised: 16 August 2022
    Received: 27 April 2022
    Published in TOMM Volume 19, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Weakly supervised temporal action localization
    2. multiple instance learning
    3. temporal pooling

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381520:6(1-21)Online publication date: 8-Mar-2024
    • (2024)Snippet-to-Prototype Contrastive Consensus Network for Weakly Supervised Temporal Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.335562826(6717-6729)Online publication date: 2024

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media