Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3552458.3556449acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-level Multi-modal Feature Fusion for Action Recognition in Videos

Published: 10 October 2022 Publication History

Abstract

Several multi-modal feature fusion approaches have been proposed in recent years in order to improve action recognition in videos. These approaches do not take full advantage of the multi-modal information in the videos, since they are biased towards a single modality or treat modalities separately. To address the multi-modal problem, we propose a Multi-Level Multi-modal feature Fusion (MLMF) for action recognition in videos. The MLMF projects each modality to shared and specific feature spaces. According to the similarity between the two modal shared features space, we augment the features in the specific feature space. As a result, the fused features not only incorporate the unique characteristics of the two modalities, but also explicitly emphasize their similarities. Moreover, the video's action segments differ in length, so the model needs to consider different-level feature ensembling for fine-grained action recognition. The optimal multi-level unified action feature representation is achieved by aggregating features at different levels. Our approach is evaluated in the EPIC-KITCHEN 100 dataset, and achieved encouraging results of action recognition in videos.

Supplementary Material

MP4 File (HCMA22-21.mp4)
Presentation video

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lui, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 6816--6826.
[2]
Tadas Baltruaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2019), 423--443.
[3]
Jo ao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 4724--4733.
[4]
Alejandro Cartas, Jordi Luque, Petia Radeva, Carlos Segura, and Mariella Dimiccoli. 2019. Seeing and Hearing Egocentric Actions: How Much Can We Learn? 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (2019), 4470--4480.
[5]
Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1910--1921.
[6]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2020. Rescaling Egocentric Vision. ArXiv, Vol. abs/2006.13256 (2020).
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6201--6210.
[9]
Antonino Furnari and Giovanni Maria Farinella. 2019. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 6251--6260.
[10]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Computer Vision -- ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 214--229.
[11]
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to Look: Action Recognition by Previewing Audio. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10454--10464.
[12]
Bernard Ghanem, Juan Carlos Niebles, Cees G. M. Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, S. Buch, and Cuong Duc Dao. 2018. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary. ArXiv, Vol. abs/1808.03766 (2018).
[13]
Ross Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 1440--1448.
[14]
David Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 1866--1874.
[15]
Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, and Amir Globerson. 2022. Object-Region Video Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3148--3159.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (11 1997), 1735--1780.
[17]
Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Transactions on Multimedia, Vol. 20, 11 (2018), 3137--3147.
[18]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.
[19]
Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2021a. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. In BMVC.
[20]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 5491--5500.
[21]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2021b. Slow-Fast Auditory Streams for Audio Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 855--859.
[22]
Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. 2021. MoViNets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16020--16030.
[23]
Vladimir lashin and Esa Rahtu. 2020. Multi-modal Dense Video Captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4117--4126.
[24]
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. ArXiv, Vol. abs/1306.5151 (2013).
[25]
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. In NeurIPS.
[26]
Andrew Owens and Alexei A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision (ECCV).
[27]
Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo ao F. Henriques. 2021. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In NeurIPS.
[28]
Fadime Sener, Dibyadip Chatterjee, and Angela Yao. 2021. Technical Report: Temporal Aggregate Representations. ArXiv, Vol. abs/2106.03152 (2021).
[29]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS'14). MIT Press, Cambridge, MA, USA, 568--576.
[30]
Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2020. Learning Video Representations using Contrastive Bidirectional Transformer.
[31]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--9.
[32]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision (ICCV). 4489--4497.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000--6010.
[34]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2019. Temporal Segment Networks for Action Recognition in Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (2019), 2740--2755.
[35]
Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7794--7803.
[37]
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. ArXiv, Vol. abs/2001.08740 (2020).
[38]
Xuehan Xiong, Anurag Arnab, Arsha Nagrani, and Cordelia Schmid. 2022. M&M Mix: A Multimodal Multiview Transformer Ensemble. arXiv preprint arXiv:2206.09852 (2022).
[39]
Huapeng Xu, Guilin Qi, Jingjing Li, Meng Wang, Kang Xu, and Huan Gao. 2018. Fine-grained Image Classification by Visual-Semantic Embedding. In IJCAI. 1043--1049.
[40]
Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3333--3343.

Cited By

View all
  • (2024)CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363642720:5(1-25)Online publication date: 11-Jan-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis
October 2022
106 pages
ISBN:9781450394925
DOI:10.1145/3552458
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. multi-level ensembling
  3. multi-modal fusion

Qualifiers

  • Research-article

Conference

MM '22
Sponsor:

Acceptance Rates

HCMA '22 Paper Acceptance Rate 12 of 21 submissions, 57%;
Overall Acceptance Rate 12 of 21 submissions, 57%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363642720:5(1-25)Online publication date: 11-Jan-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media