Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475344acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Published: 17 October 2021 Publication History

Abstract

Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.

References

[1]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. CVPR.
[2]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proc. CVPR.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR.
[4]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proc. CVPR.
[5]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR.
[6]
Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proc. CVPR. 203--213.
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. Proc. ICCV (2019).
[8]
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In Proc. CVPR.
[9]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" Something Something" Video Database for Learning and Evaluating Visual Common Sense. In Proc. ICCV.
[10]
Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. Stnet: Local and global spatial-temporal modeling for action recognition. In AAAI.
[11]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[12]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proc. ICCV. 2000--2009.
[13]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[14]
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. Scsampler: Sampling salient clips from video for efficient action recognition. In Proc. ICCV. 6232--6242.
[15]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal Excitation and Aggregation for Action Recognition. In Proc. CVPR. 909--918.
[16]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In Proc. ICCV.
[17]
Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In AAAI, Vol. 32.
[18]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. TEINet: Towards an Efficient Architecture for Video Recognition. In Proc. AAAI. 11669--11676.
[19]
Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification. In Proc. CVPR.
[20]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proc. ICCV.
[21]
Yemin Shi, Yonghong Tian, Yaowei Wang, Wei Zeng, and Tiejun Huang. 2017. Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proc. ICCV.
[22]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neurips.
[23]
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2020. Gate-Shift Networks for Video Action Recognition. In Proc. CVPR. 1102--1111.
[24]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proc. ICCV.
[25]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In Proc. ICCV. 5552--5561.
[26]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proc. CVPR.
[27]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proc. ECCV.
[28]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proc. CVPR.
[29]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proc. ECCV.
[30]
Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021. Adaptive Focus for Efficient Video Recognition. arXiv preprint arXiv:2105.03245 (2021).
[31]
Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, and Errui Ding. 2021. MVFNet: Multi-View Fusion Network for Efficient Video Recognition. In Proc. AAAI.
[32]
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition. In Proc. ICCV.
[33]
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang, and Shilei Wen. 2020. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition. In Proceedings of CVPR Workshops. 676--677.
[34]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proc. ECCV.
[35]
Kaiyu Yue, Ming Sun, Yuchen Yuan, Feng Zhou, Errui Ding, and Fuxin Xu. 2018. Compact Generalized Non-local Network. Neurips (2018).
[36]
Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R Scott, and Limin Wang. 2020. V4D: 4D Convolutional Neural Networks for Video-level Representation Learning. In ICLR.
[37]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proc. ECCV.
[38]
Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient Convolutional Network for Online Video Understanding. In Proc. ECCV.

Cited By

View all
  • (2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
  • (2023)YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect DetectionMachines10.3390/machines1107067711:7(677)Online publication date: 23-Jun-2023
  • (2023)A BERT-Based Joint Channel–Temporal Modeling for Action RecognitionIEEE Sensors Journal10.1109/JSEN.2023.330391223:19(23765-23779)Online publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action recognition
    2. neural networks
    3. video representation learning

    Qualifiers

    • Research-article

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
    • (2023)YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect DetectionMachines10.3390/machines1107067711:7(677)Online publication date: 23-Jun-2023
    • (2023)A BERT-Based Joint Channel–Temporal Modeling for Action RecognitionIEEE Sensors Journal10.1109/JSEN.2023.330391223:19(23765-23779)Online publication date: 1-Oct-2023
    • (2023)What Can Simple Arithmetic Operations Do for Temporal Modeling?2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01261(13666-13676)Online publication date: 1-Oct-2023
    • (2023)Audio-Visual Glance Network for Efficient Video Recognition2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00931(10116-10125)Online publication date: 1-Oct-2023
    • (2023)Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00640(6620-6630)Online publication date: Jun-2023
    • (2023)Transferring Vision-Language Models for Visual Recognition: A Classifier PerspectiveInternational Journal of Computer Vision10.1007/s11263-023-01876-w132:2(392-409)Online publication date: 7-Sep-2023
    • (2023)MAF: Multimodal Auto Attention Fusion for Video ClassificationAdvances and Trends in Artificial Intelligence. Theory and Applications10.1007/978-3-031-36819-6_22(253-264)Online publication date: 19-Jul-2023
    • (2022)Multi-scale adaptive network for single image denoisingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601295(14099-14112)Online publication date: 28-Nov-2022
    • (2022)MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation LearningProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547888(1348-1357)Online publication date: 10-Oct-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media