Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664647.3680723acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Free access

Multi-Granularity Hand Action Detection

Published: 28 October 2024 Publication History

Abstract

Detecting hand actions in videos is crucial for understanding video content and has diverse real-world applications. Existing approaches often focus on whole-body actions or coarse-grained action categories, lacking fine-grained hand-action localization information. To fill this gap, we introduce the FHA-Kitchens (Fine-Grained Hand Actions in Kitchen Scenes) dataset, providing both coarse- and fine-grained hand action categories along with localization annotations. This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories. Evaluation of existing action detection methods on FHA-Kitchens reveals varying generalization capabilities across different granularities. To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. It incorporates two new designs: Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising. Extensive experiments demonstrate MG-HAD's effectiveness for multi-granularity hand action detection, highlighting the significance of FHA-Kitchens for future research and real-world applications. The dataset and source code are available at MG-HAD.

References

[1]
Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. 2005. Actions as space-time shapes. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2. IEEE, 1395--1402.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. Springer, 213--229.
[3]
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A short note on the kinetics-700 human action dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155 (2019).
[7]
Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. 2023. Cycleacr: Cycle modeling of actor-context relations for video action detection. arXiv preprint arXiv:2303.16118 (2023).
[8]
Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, and Limin Wang. 2023. Efficient video action detection with token dropout and context refinement. In Proceedings of the IEEE International Conference on Computer Vision. 10388--10399.
[9]
Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and Ping Luo. 2021. Watch only once: An end-to-end video action detection framework. In Proceedings of the IEEE International Conference on Computer Vision. 8158--8167. https://doi.org/10.1109/ICCV48922.2021.00807
[10]
MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.
[11]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. Ieee, 886--893.
[12]
Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision: 9th European Conference on Computer Vision, Graz, Austria, May 7--13, 2006. Proceedings, Part II 9. Springer, 428--441.
[13]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision. 720--736.
[14]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[15]
Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. 2017. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017).
[16]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.
[17]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--213.
[18]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.
[19]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.
[20]
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. YOLOX: Exceeding yolo series in 2021. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[21]
Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244--253.
[22]
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (challenge). 6047--6056.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[24]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 961--970.
[25]
Fabian Caba Heilbron, Joon-Young Lee, Hailin Jin, and Bernard Ghanem. 2018. What do i annotate next? an empirical study of active learning for action localization. In Proceedings of the European Conference on Computer Vision. 199--216.
[26]
James Hong, Matthew Fisher, Michaël Gharbi, and Kayvon Fatahalian. 2021. Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 9254--9263.
[27]
Hezhen Hu, Weilun Wang, Wengang Zhou, and Houqiang Li. 2022. Hand-object interaction image generation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 23805--23817.
[28]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. 2013. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3192--3199.
[29]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
[30]
Ang Li, Meghana Thotakuri, David A Ross, Jo ao Carreira, Alexander Vostrikov, and Andrew Zisserman. 2020. The ava-kinetics localized human actions video dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (challenge).
[31]
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. 2023. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning. Proceedings of Machine Learning Research, 80--93.
[32]
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. 2022. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13619--13627.
[33]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.
[34]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083--7093.
[35]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980--2988.
[36]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 740--755.
[37]
Fang Liu, Liang Zhao, Xiaochun Cheng, Qin Dai, Xiangbin Shi, and Jianzhong Qiao. 2020. Fine-grained action recognition by motion saliency and mid-level patches. Applied Sciences, Vol. 10, 8 (2020), 2811.
[38]
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. 2022. Dab-detr: Dynamic anchor boxes are better queries for detr. Proceedings of the International Conference on Learning Representations.
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision. 10012--10022.
[40]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3211.
[41]
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. 2019. Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 2 (2019), 502--508.
[42]
Jonathan Munro and Dima Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 122--132.
[43]
Bingbing Ni, Xiaokang Yang, and Shenghua Gao. 2016. Progressively parsing interactional objects for fine grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1020--1028.
[44]
G. Palli, S. Pirozzi, C. Natale, G. De Maria, and C. Melchiorri. 2013. Mechatronic design of innovative robot hands: Integration and control issues. In Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics. 1755--1760. https://doi.org/10.1109/AIM.2013.6584351
[45]
Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. 2021. Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 464--474.
[46]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, Vol. 28 (2015).
[47]
Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1194--1201.
[48]
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. 2023. Hiera: A hierarchical vision transformer without the bells-and-whistles. In Proceedings of the International Conference on Machine Learning. PMLR, 29441--29454.
[49]
Christian Schuldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition, Vol. 3. IEEE, 32--36.
[50]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2616--2625.
[51]
Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. 2022. React: Temporal action detection with relational queries. In Proceedings of the European Conference on Computer Vision. Springer, 105--121.
[52]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14. Springer, 510--526.
[53]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Vol. 27 (2014).
[54]
Lucas Smaira, Jo ao Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. 2020. A short note on the kinetics-700--2020 human action dataset. arXiv preprint arXiv:2010.10864 (2020).
[55]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[56]
Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. 2021. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14454--14463.
[57]
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020. Asynchronous interaction aggregation for action detection. In Proceedings of the European Conference on Computer Vision: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XV 16. Springer, 71--87.
[58]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.
[59]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[60]
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14549--14560.
[61]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20--36.
[62]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2018), 2740--2755.
[63]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[64]
Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, and Limin Wang. 2023. Stmixer: A one-stage sparse action detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14720--14729.
[65]
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 38571--38584.
[66]
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2024. ViTPose: Vision Transformer for Generic Body Pose Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 46, 2 (2024), 1212--1230. https://doi.org/10.1109/TPAMI.2023.3330016
[67]
Ruolin Ye, Wenqiang Xu, Haoyuan Fu, Rajat Kumar Jenamani, Vy Nguyen, Cewu Lu, Katherine Dimitropoulou, and Tapomayukh Bhattacharjee. 2022. RCare World: A Human-centric Simulation World for Caregiving Robots. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems. IEEE, 33--40.
[68]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. 2023. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.
[69]
Jing Zhang and Dacheng Tao. 2020. Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal, Vol. 8, 10 (2020), 7789--7817.
[70]
Shilong Zhang, Xinjiang Wang, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7329--7338.
[71]
Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. 2019. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE International Conference on Computer Vision. 8668--8678.
[72]
Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. 2021. TubeR: Tubelet transformer for video action detection. arXiv preprint arXiv:2104.00969 (2021).
[73]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision. 803--818.
[74]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset
  2. hand action detection
  3. multi-granularity

Qualifiers

  • Research-article

Funding Sources

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 21
    Total Downloads
  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)21
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media