research-article

ConvTransformer Attention Network for temporal action detection

Authors:

Xiangdong WangAuthors Info & Claims

Volume 300, Issue C

https://doi.org/10.1016/j.knosys.2024.112264

Published: 18 November 2024 Publication History

Abstract

Boundary detection is a challenging problem in Temporal Action Detection (TAD). While transformer-based methods achieve satisfactory results by incorporating self-attention mechanisms to model global dependencies for boundary detection, they face two key issues. Firstly, they lack explicit learning of local relationships; this limitation results in imprecise boundary detection when subtle appearance changes occur between adjacent clips. Secondly, transformer-based methods lead to feature convergence across multiple actions due to the self-attention mechanism’s tendency to distribute focus across the entire input video, resulting in the prediction of imprecisely overlapping actions. To address these challenges, we introduce the ConvTransformer Attention Network (CTAN), a novel framework comprised of two primary components: (1) The Temporal Attention Block (TAB), a temporal attention mechanism designed to emphasize critical temporal positions enriched with essential action-related features. (2) The ConvTransformer Block (CTB), which employs a hybrid structure for capturing nuanced appearance changes locally and action transitions globally. Facilitated with these components, CTAN is adept at focusing on motion features between overlapping actions, and precisely capturing both local differences between adjacent clips and global action transitions. The extensive experiments on multiple datasets, including THUMOS14, MultiTHUMOS, and ActivityNet, confirm the effectiveness of CTAN.

References

[1]

Idrees Haroon, Zamir Amir R., Jiang Yu-Gang, Gorban Alex, Laptev Ivan, Sukthankar Rahul, Shah Mubarak, The thumos challenge on action recognition for videos “in the wild”, Comput. Vis. Image Underst. 155 (2017) 1–23.

[2]

Yeung Serena, Russakovsky Olga, Jin Ning, Andriluka Mykhaylo, Mori Greg, Fei-Fei Li, Every moment counts: Dense detailed labeling of actions in complex videos, Int. J. Comput. Vis. 126 (2018) 375–389.

[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.

[4]

Zhu Wencheng, Lu Jiwen, Han Yucheng, Zhou Jie, Learning multiscale hierarchical attention for video summarization, Pattern Recognit. 122 (2022).

Digital Library

[5]

Omarov Batyrkhan, Narynov Sergazi, Zhumanov Zhandos, Gumar Aidana, Khassanova Mariyam, State-of-the-art violence detection techniques in video surveillance security systems: a systematic review, PeerJ Comput. Sci. 8 (2022).

[6]

Manh Tung Tran, Minh Quan Vu, Ngoc Duong Hoang, Khac-Hoai Nam Bui, An effective temporal localization method with multi-view 3D action recognition for untrimmed naturalistic driving videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3168–3173.

[7]

Zhang Chen-Lin, Wu Jianxin, Li Yin, Actionformer: Localizing moments of actions with transformers, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer, 2022, pp. 492–510.

[8]

Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu, Learning salient boundary feature for anchor-free temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3320–3329.

[9]

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen, Bmn: Boundary-matching network for temporal action proposal generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3889–3898.

[10]

Rui Dai, Srijan Das, Luca Minciullo, Lorenzo Garattoni, Gianpiero Francesca, François Bremond, Pdan: Pyramid dilated attention network for action detection, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2970–2979.

[11]

Wang Chenhao, Cai Hongxiang, Zou Yuxin, Xiong Yichao, Rgb stream is enough for temporal action detection, 2021, arXiv preprint arXiv:2107.04362.

[12]

Liu Xiaolong, Wang Qimeng, Hu Yao, Tang Xu, Zhang Shiwei, Bai Song, Bai Xiang, End-to-end temporal action detection with transformer, IEEE Trans. Image Process. 31 (2022) 5427–5441.

[13]

Kang Tae-Kyung, Lee Gun-Hee, Lee Seong-Whan, Htnet: Anchor-free temporal action localization with hierarchical transformers, in: 2022 IEEE International Conference on Systems, Man, and Cybernetics, SMC, IEEE, 2022, pp. 365–370.

[14]

Shi Dingfeng, Zhong Yujie, Cao Qiong, Zhang Jing, Ma Lin, Li Jia, Tao Dacheng, React: Temporal action detection with relational queries, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, Springer, 2022, pp. 105–121.

[15]

Cheng Feng, Bertasius Gedas, TallFormer: Temporal action localization with a long-memory transformer, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, Springer, 2022, pp. 503–521.

[16]

Jie Hu, Li Shen, Gang Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.

[17]

Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, Qinghua Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11534–11542.

[18]

Zequn Qin, Pengyi Zhang, Fei Wu, Xi Li, Fcanet: Frequency channel attention networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 783–792.

[19]

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei, Deformable convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.

[20]

Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai, Deformable convnets v2: More deformable, better results, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316.

[21]

Xiang Li, Wenhai Wang, Xiaolin Hu, Jian Yang, Selective kernel networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 510–519.

[22]

Park Jongchan, Woo Sanghyun, Lee Joon-Young, Kweon In So, Bam: Bottleneck attention module, 2018, arXiv preprint arXiv:1807.06514.

[23]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 3–19.

[24]

Roy Abhijit Guha, Navab Nassir, Wachinger Christian, Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks, IEEE Trans. Med. Imaging 38 (2) (2018) 540–549.

[25]

Jiafeng Li, Ying Wen, Lianghua He, Scconv: spatial and channel reconstruction convolution for feature redundancy, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6153–6162.

[26]

Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, Tong Lu, Tam: Temporal adaptive module for video recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13708–13718.

[27]

Limin Wang, Zhan Tong, Bin Ji, Gangshan Wu, Tdn: Temporal difference networks for efficient action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1895–1904.

[28]

Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, Qi Tian, Spatial-temporal graph convolutional network for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3289–3299.

[29]

Yang Fu, Xiaoyang Wang, Yunchao Wei, Thomas Huang, Sta: Spatial-temporal attention for large-scale video-based person re-identification, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, No. 01, 2019, pp. 8287–8294.

[30]

Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, Polosukhin Illia, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).

Digital Library

[31]

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.

[32]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

[33]

Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, Zagoruyko Sergey, End-to-end object detection with transformers, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, 2020, pp. 213–229.

[34]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, Slowfast networks for video recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6202–6211.

[35]

Tong Zhan, Song Yibing, Wang Jue, Wang Limin, Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, Adv. Neural Inf. Process. Syst. 35 (2022) 10078–10093.

[36]

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao, Videomae v2: Scaling video masked autoencoders with dual masking, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14549–14560.

[37]

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu, Mobile-former: Bridging mobilenet and transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279.

[38]

Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, Qixiang Ye, Conformer: Local features coupling global representations for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367–376.

[39]

Changlin Li, Tao Tang, Guangrun Wang, Jiefeng Peng, Bing Wang, Xiaodan Liang, Xiaojun Chang, Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12281–12291.

[40]

Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, Adam Hartwig, Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017, arXiv preprint arXiv:1704.04861.

[41]

Bertasius Gedas, Wang Heng, Torresani Lorenzo, Is space-time attention all you need for video understanding?, ICML, vol. 2, 2021, p. 4.

[42]

Weng Yuetian, Pan Zizheng, Han Mingfei, Chang Xiaojun, Zhuang Bohan, An efficient spatio-temporal pyramid transformer for action detection, in: European Conference on Computer Vision, Springer, 2022, pp. 358–375.

[43]

Joao Carreira, Andrew Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.

[44]

Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, Van Gool Luc, Temporal segment networks for action recognition in videos, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2740–2755.

[45]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.

[46]

Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, Dongwei Ren, Distance-IoU loss: Faster and better learning for bounding box regression, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 07, 2020, pp. 12993–13000.

[47]

Yang Le, Peng Houwen, Zhang Dingwen, Fu Jianlong, Han Junwei, Revisiting anchor mechanisms for temporal action localization, IEEE Trans. Image Process. 29 (2020) 8535–8548.

[48]

Zheng Shou, Dongang Wang, Shih-Fu Chang, Temporal action localization in untrimmed videos via multi-stage cnns, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1049–1058.

[49]

Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, Ram Nevatia, Turn tap: Temporal unit regression network for temporal action proposals, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3628–3636.

[50]

Gao Jiyang, Yang Zhenheng, Nevatia Ram, Cascaded boundary regression for temporal action detection, 2017, arXiv preprint arXiv:1705.01180.

[51]

Yang Le, Han Junwei, Zhao Tao, Liu Nian, Zhang Dingwen, Structured attention composition for temporal action localization, IEEE Trans. Image Process. (2022).

[52]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan, Graph convolutional networks for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7094–7103.

[53]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han Hu, Video swin transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.

[54]

Tang Tuan N., Kim Kwonyoung, Sohn Kwanghoon, TemporalMaxer: Maximize temporal context with only max pooling for temporal action localization, 2023, arXiv preprint arXiv:2303.09055.

[55]

Tan Jing, Zhao Xiaotong, Shi Xintian, Kang Bin, Wang Limin, Pointtad: Multi-label temporal action detection with learnable query points, Adv. Neural Inf. Process. Syst. 35 (2022) 15268–15280.

[56]

Praveen Tirupattur, Kevin Duarte, Yogesh S. Rawat, Mubarak Shah, Modeling multi-label action dependencies for temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1460–1470.

[57]

Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, François Brémond, MS-TCT: multi-scale temporal convtransformer for action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20041–20051.

[58]

Loshchilov Ilya, Hutter Frank, Decoupled weight decay regularization, 2017, arXiv preprint arXiv:1711.05101.

[59]

I. Loshchilov, F. Hutter, Stochastic gradient descent with warm restarts, in: Proceedings of the 5th Int. Conf. Learning Representations, pp. 1–16.

[60]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.

[61]

Qinying Liu, Zilei Wang, Progressive boundary refinement network for temporal action detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 07, 2020, pp. 11612–11619.

[62]

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Diagnosing error in temporal action detectors, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 256–272.

Index Terms

ConvTransformer Attention Network for temporal action detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
      2. Computer vision tasks
        Activity recognition and understanding
        Video summarization
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Index terms have been assigned to the content through auto-classification.

Recommendations

Temporal Relation-Aware Global Attention Network for Temporal Action Detection
Advanced Intelligent Computing Technology and Applications
Abstract
Temporal Action Detection (TAD) is a crucial task in video understanding. Its primary objective is to accurately identify the semantic labels of each action instance in an untrimmed video, along with their temporal range. This paper constructs the ...
Single Shot Temporal Action Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting ...
Temporal Action Detection with Structured Segment Networks
Abstract
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems

Knowledge-Based Systems Volume 300, Issue C

Sep 2024

1714 pages

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 18 November 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents