research-article

HAAN: Human Action Aware Network for Multi-label Temporal Action Detection

Authors:

Yong DouAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5059 - 5069

https://doi.org/10.1145/3581783.3612097

Published: 27 October 2023 Publication History

Abstract

The task of multi-label temporal action detection aims to accurately detect dense action instances in untrimmed videos. Previous methods focused on modeling the appearance features of RGB images have struggled to capture the fine details and subtle variations in human actions, resulting in three critical issues: overlapping action confusion, intra-class appearance diversity, and background interferences. These issues have significantly undermined the accuracy and generalization of detection models. To tackle these issues, we propose incorporating the human skeleton into the feature design of the detection model. By utilizing multi-person skeletons, our proposed method can accurately represent various human actions in the scene, balance the salience of overlapping actions, and reduce the impact of changes in human appearance and background interferences on action features. Overall, we propose a novel two-stream human action aware network~(HAAN) for multi-label temporal action detection based on the original RGB frames and the estimated skeleton frames. To leverage the complementary advantages of RGB features and skeleton features, we design a cross-modality fusion module that allows the two features to guide each other and enhance their representation of human actions. On the popular benchmarks MultiTHUMOS and Charades, our HAAN achieves state-of-the-art performance with 56.9% (+5.4%) and 32.1% (+3.3%) mean average precision (mAP) compared to the best available methods. Importantly, HAAN shows superior improvements of +6.83%, +22.35%, and +2.56% on the challenging sample subsets of the three critical issues.

References

[1]

Karteek Alahari, Guillaume Seguin, Josef Sivic, and Ivan Laptev. 2013. Pose estimation and segmentation of people in 3D movies. In Proceedings of the IEEE International Conference on Computer Vision. 2112--2119.

Digital Library

[2]

Jake Bouvrie. 2006. Notes on convolutional neural networks. (2006).

[3]

Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Procedings of the British Machine Vision Conference 2017. British Machine Vision Association.

[4]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[6]

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).

[7]

MMPose Contributors. 2020a. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose.

[8]

MMAction2 Contributors. 2020b. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.

[9]

MMPreTrain Contributors. 2023. OpenMMLab's Pre-training Toolbox and Benchmark. https://github.com/open-mmlab/mmpretrain.

[10]

Rui Dai, Srijan Das, and Francois Bremond. 2021a. CTRN: Class-Temporal Relational Network for Action Detection. arXiv preprint arXiv:2110.13473 (2021).

[11]

Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S Ryoo, and François Brémond. 2022. MS-TCT: multi-scale temporal convtransformer for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20041--20051.

[12]

Rui Dai, Srijan Das, Luca Minciullo, Lorenzo Garattoni, Gianpiero Francesca, and Francc ois Bremond. 2021b. Pdan: Pyramid dilated attention network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2970--2979.

[13]

Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, and Yi Tay. 2022. Scenic: A JAX library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21393--21398.

[14]

Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web. 577--586.

Digital Library

[15]

Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 7351--7354.

Digital Library

[16]

Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 768--784.

[17]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.

[18]

Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. Ctap: Complementary temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 68--83.

Digital Library

[19]

Marco Gori, Gabriele Monfardini, and Franco Scarselli. 2005. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2. IEEE, 729--734.

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[21]

Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1914--1923.

[22]

Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image and vision computing, Vol. 60 (2017), 4--21.

Digital Library

[23]

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recognition for videos ?in the wild". Computer Vision and Image Understanding, Vol. 155 (2017), 1--23.

[24]

Kumara Kahatapitiya and Michael S Ryoo. 2021. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8385--8394.

[25]

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).

[26]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988--996.

Digital Library

[27]

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.

Digital Library

[28]

Megha Nawhal and Greg Mori. 2021. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540 (2021).

[29]

AJ Piergiovanni and Michael Ryoo. 2019. Temporal gaussian mixture layer for videos. In International Conference on Machine learning. PMLR, 5152--5161.

[30]

AJ Piergiovanni and Michael S Ryoo. 2018. Learning latent super-events to detect multiple activities in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5304--5313.

[31]

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485--494.

[32]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 510--526.

[33]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, Vol. 27 (2014).

[34]

Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. Bsn: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2602--2610.

[35]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019a. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5693--5703.

[36]

Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. 2019b. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514 (2019).

[37]

Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, and Limin Wang. 2022. Pointtad: Multi-label temporal action detection with learnable query points. Advances in Neural Information Processing Systems, Vol. 35 (2022), 15268--15280.

[38]

Praveen Tirupattur, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2021. Modeling multi-label action dependencies for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1460--1470.

[39]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.

Digital Library

[40]

Elahe Vahdani and Yingli Tian. 2022. Deep learning-based action detection in untrimmed videos: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).

Digital Library

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[42]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.

[43]

Huifen Xia and Yongzhao Zhan. 2020. A survey on temporal action localization. IEEE Access, Vol. 8 (2020), 70477--70487.

[44]

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision. 5783--5792.

[45]

Yi Yang and Deva Ramanan. 2012. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 12 (2012), 2878--2890.

[46]

Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, Vol. 126 (2018), 375--389.

Digital Library

[47]

Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2678--2687.

[48]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).

[49]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.

Digital Library

Cited By

Zhu YZhang GTan JWu GWang L(2024)Dual DETRs for Multi-Label Temporal Action Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01756(18559-18569)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01756

Index Terms

HAAN: Human Action Aware Network for Multi-label Temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Deep cascaded action attention network for weakly-supervised temporal action localization
Abstract
Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly ...
Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...
Human motion detection and action recognition

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)146
Downloads (Last 6 weeks)9

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu YZhang GTan JWu GWang L(2024)Dual DETRs for Multi-Label Temporal Action Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01756(18559-18569)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01756

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents