Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3612097acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

HAAN: Human Action Aware Network for Multi-label Temporal Action Detection

Published: 27 October 2023 Publication History

Abstract

The task of multi-label temporal action detection aims to accurately detect dense action instances in untrimmed videos. Previous methods focused on modeling the appearance features of RGB images have struggled to capture the fine details and subtle variations in human actions, resulting in three critical issues: overlapping action confusion, intra-class appearance diversity, and background interferences. These issues have significantly undermined the accuracy and generalization of detection models. To tackle these issues, we propose incorporating the human skeleton into the feature design of the detection model. By utilizing multi-person skeletons, our proposed method can accurately represent various human actions in the scene, balance the salience of overlapping actions, and reduce the impact of changes in human appearance and background interferences on action features. Overall, we propose a novel two-stream human action aware network~(HAAN) for multi-label temporal action detection based on the original RGB frames and the estimated skeleton frames. To leverage the complementary advantages of RGB features and skeleton features, we design a cross-modality fusion module that allows the two features to guide each other and enhance their representation of human actions. On the popular benchmarks MultiTHUMOS and Charades, our HAAN achieves state-of-the-art performance with 56.9% (+5.4%) and 32.1% (+3.3%) mean average precision (mAP) compared to the best available methods. Importantly, HAAN shows superior improvements of +6.83%, +22.35%, and +2.56% on the challenging sample subsets of the three critical issues.

References

[1]
Karteek Alahari, Guillaume Seguin, Josef Sivic, and Ivan Laptev. 2013. Pose estimation and segmentation of people in 3D movies. In Proceedings of the IEEE International Conference on Computer Vision. 2112--2119.
[2]
Jake Bouvrie. 2006. Notes on convolutional neural networks. (2006).
[3]
Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Procedings of the British Machine Vision Conference 2017. British Machine Vision Association.
[4]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).
[7]
MMPose Contributors. 2020a. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose.
[8]
MMAction2 Contributors. 2020b. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.
[9]
MMPreTrain Contributors. 2023. OpenMMLab's Pre-training Toolbox and Benchmark. https://github.com/open-mmlab/mmpretrain.
[10]
Rui Dai, Srijan Das, and Francois Bremond. 2021a. CTRN: Class-Temporal Relational Network for Action Detection. arXiv preprint arXiv:2110.13473 (2021).
[11]
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S Ryoo, and François Brémond. 2022. MS-TCT: multi-scale temporal convtransformer for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20041--20051.
[12]
Rui Dai, Srijan Das, Luca Minciullo, Lorenzo Garattoni, Gianpiero Francesca, and Francc ois Bremond. 2021b. Pdan: Pyramid dilated attention network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2970--2979.
[13]
Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, and Yi Tay. 2022. Scenic: A JAX library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21393--21398.
[14]
Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web. 577--586.
[15]
Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 7351--7354.
[16]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 768--784.
[17]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.
[18]
Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. Ctap: Complementary temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 68--83.
[19]
Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. IEEE, 729--734.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[21]
Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1914--1923.
[22]
Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image and vision computing, Vol. 60 (2017), 4--21.
[23]
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recognition for videos ?in the wild". Computer Vision and Image Understanding, Vol. 155 (2017), 1--23.
[24]
Kumara Kahatapitiya and Michael S Ryoo. 2021. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8385--8394.
[25]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[26]
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988--996.
[27]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.
[28]
Megha Nawhal and Greg Mori. 2021. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540 (2021).
[29]
AJ Piergiovanni and Michael Ryoo. 2019. Temporal gaussian mixture layer for videos. In International Conference on Machine learning. PMLR, 5152--5161.
[30]
AJ Piergiovanni and Michael S Ryoo. 2018. Learning latent super-events to detect multiple activities in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5304--5313.
[31]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485--494.
[32]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 510--526.
[33]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, Vol. 27 (2014).
[34]
Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. Bsn: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2602--2610.
[35]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019a. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5693--5703.
[36]
Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. 2019b. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 (2019).
[37]
Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, and Limin Wang. 2022. Pointtad: Multi-label temporal action detection with learnable query points. Advances in Neural Information Processing Systems, Vol. 35 (2022), 15268--15280.
[38]
Praveen Tirupattur, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2021. Modeling multi-label action dependencies for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1460--1470.
[39]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[40]
Elahe Vahdani and Yingli Tian. 2022. Deep learning-based action detection in untrimmed videos: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[42]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[43]
Huifen Xia and Yongzhao Zhan. 2020. A survey on temporal action localization. IEEE Access, Vol. 8 (2020), 70477--70487.
[44]
Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision. 5783--5792.
[45]
Yi Yang and Deva Ramanan. 2012. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 12 (2012), 2878--2890.
[46]
Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, Vol. 126 (2018), 375--389.
[47]
Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2678--2687.
[48]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
[49]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.

Cited By

View all
  • (2024)Dual DETRs for Multi-Label Temporal Action Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01756(18559-18569)Online publication date: 16-Jun-2024

Index Terms

  1. HAAN: Human Action Aware Network for Multi-label Temporal Action Detection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. modality fusion
    2. multi-label temporal action detection
    3. rgb
    4. skeleton
    5. transformer

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)146
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Dual DETRs for Multi-Label Temporal Action Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01756(18559-18569)Online publication date: 16-Jun-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media