research-article

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Authors:

Yifeng ShiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1903 - 1911

https://doi.org/10.1145/3474085.3475344

Published: 17 October 2021 Publication History

Abstract

Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.

References

[1]

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. CVPR.

[2]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proc. CVPR.

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR.

[4]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proc. CVPR.

[5]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR.

[6]

Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proc. CVPR. 203--213.

[7]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. Proc. ICCV (2019).

[8]

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In Proc. CVPR.

[9]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" Something Something" Video Database for Learning and Evaluating Visual Common Sense. In Proc. ICCV.

[10]

Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. Stnet: Local and global spatial-temporal modeling for action recognition. In AAAI.

[11]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[12]

Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proc. ICCV. 2000--2009.

[13]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[14]

Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. In Proc. ICCV. 6232--6242.

[15]

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal Excitation and Aggregation for Action Recognition. In Proc. CVPR. 909--918.

[16]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proc. ICCV.

[17]

Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In AAAI, Vol. 32.

[18]

Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. TEINet: Towards an Efficient Architecture for Video Recognition. In Proc. AAAI. 11669--11676.

[19]

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In Proc. CVPR.

[20]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proc. ICCV.

[21]

Yemin Shi, Yonghong Tian, Yaowei Wang, Wei Zeng, and Tiejun Huang. 2017. Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proc. ICCV.

[22]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neurips.

Digital Library

[23]

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2020. Gate-Shift Networks for Video Action Recognition. In Proc. CVPR. 1102--1111.

[24]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV.

Digital Library

[25]

Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In Proc. ICCV. 5552--5561.

[26]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proc. CVPR.

[27]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proc. ECCV.

[28]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proc. CVPR.

[29]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proc. ECCV.

[30]

Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021. Adaptive Focus for Efficient Video Recognition. arXiv preprint arXiv:2105.03245 (2021).

[31]

Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, and Errui Ding. 2021. MVFNet: Multi-View Fusion Network for Efficient Video Recognition. In Proc. AAAI.

[32]

Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition. In Proc. ICCV.

[33]

Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang, and Shilei Wen. 2020. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition. In Proceedings of CVPR Workshops. 676--677.

[34]

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proc. ECCV.

[35]

Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. 2018. Compact Generalized Non-local Network. Neurips (2018).

Digital Library

[36]

Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R Scott, and Limin Wang. 2020. V4D: 4D Convolutional Neural Networks for Video-level Representation Learning. In ICLR.

[37]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proc. ECCV.

[38]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient Convolutional Network for Online Video Understanding. In Proc. ECCV.

Cited By

Zhang DZhu WLiao XQi FYang GDing X(2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664654
Hussain M(2023)YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect DetectionMachines10.3390/machines1107067711:7(677)Online publication date: 23-Jun-2023
https://doi.org/10.3390/machines11070677
Yang MGan LCao RLi X(2023)A BERT-Based Joint Channel–Temporal Modeling for Action RecognitionIEEE Sensors Journal10.1109/JSEN.2023.330391223:19(23765-23779)Online publication date: 1-Oct-2023
https://doi.org/10.1109/JSEN.2023.3303912
Show More Cited By

Index Terms

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Learning Video Actions in Two Stream Recurrent Neural Network
Highlights
- Action-recognition using two-stream deep neural networks using LSTMs.
- LSTM for ...
Abstract
The paper investigates the Long short term memory (LSTM) networks for human action recognition in videos. In spite of significant progress in the field, recognizing actions in real-world videos is a challenging task due to the spatial ...
Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition
Abstract
Video-based action recognition is an important task in the computer vision community, aiming to extract rich spatial–temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale ...
Video-level Multi-model Fusion for Action Recognition
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

The approaches based on spatio-temporal features for video action recognition have emerged such as two-stream based methods and 3D convolution based methods. However, current methods suffer from the problems caused by partial observation, or restricted ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
160
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang DZhu WLiao XQi FYang GDing X(2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3664654
Hussain M(2023)YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect DetectionMachines10.3390/machines1107067711:7(677)Online publication date: 23-Jun-2023
https://doi.org/10.3390/machines11070677
Yang MGan LCao RLi X(2023)A BERT-Based Joint Channel–Temporal Modeling for Action RecognitionIEEE Sensors Journal10.1109/JSEN.2023.330391223:19(23765-23779)Online publication date: 1-Oct-2023
https://doi.org/10.1109/JSEN.2023.3303912
Wu WSong YSun ZWang JXu COuyang W(2023)What Can Simple Arithmetic Operations Do for Temporal Modeling?2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01261(13666-13676)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01261
Nugroho MWoo SLee SKim C(2023)Audio-Visual Glance Network for Efficient Video Recognition2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00931(10116-10125)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00931
Wu WWang XLuo HWang JYang YOuyang W(2023)Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00640(6620-6630)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00640
Wu WSun ZSong YWang JOuyang W(2023)Transferring Vision-Language Models for Visual Recognition: A Classifier PerspectiveInternational Journal of Computer Vision10.1007/s11263-023-01876-w132:2(392-409)Online publication date: 7-Sep-2023
https://doi.org/10.1007/s11263-023-01876-w
Zheng CDing WShen SChen P(2023)MAF: Multimodal Auto Attention Fusion for Video ClassificationAdvances and Trends in Artificial Intelligence. Theory and Applications10.1007/978-3-031-36819-6_22(253-264)Online publication date: 19-Jul-2023
https://doi.org/10.1007/978-3-031-36819-6_22
Gou YHu PLv JZhou JPeng XKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Multi-scale adaptive network for single image denoisingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601295(14099-14112)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601295
Fang BWu WLiu CZhou YHe DWang WMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547888(1348-1357)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547888
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents