Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3478329acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

PyTorchVideo: A Deep Learning Library for Video Understanding

Published: 17 October 2021 Publication History

Abstract

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

References

[1]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML.
[2]
Cisco. 2020. Annual Internet Report (2018--2023) White Paper.
[3]
MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Under- standing Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.
[4]
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proc. CVPR.
[5]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. In ECCV.
[6]
Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. 2020. PySlowFast. https://github.com/facebookresearch/slowfast.
[7]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers.
[8]
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In CVPR.
[9]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast networks for video recognition. In ICCV.
[10]
Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In CVPR.
[11]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. In ICCV.
[12]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao- han Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. In NIPS.
[13]
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In CVPR.
[14]
Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, and Yi Zhu. 2020. GluonCV and Glu- onNLP: Deep Learning in Computer Vision and Natural Language Processing. In Journal of Machine Learning Research.
[15]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2021. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
[17]
Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR (2020).
[18]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv.
[19]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV.
[20]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV.
[21]
Michel Silva, Washington Ramos, João Ferreira, Felipe Chamone, Mario Campos, and Erickson R. Nascimento. 2018. A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos. In CVPR.
[22]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Calsses from Videos in the Wild. Technical Report CRCV-TR-12-01.
[23]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video Classifi- cation with Channel-Separated Convolutional Networks. In Proc. ICCV.
[24]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.
[25]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In CVPR.
[26]
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Fe- ichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).
[27]
Dahua Lin Yue Zhao, Yuanjun Xiong. 2019. MMAction. https://github.com/ open-mmlab/mmaction.

Cited By

View all
  • (2024)Video WeAther RecoGnition (VARG): An Intensity-Labeled Video Weather Recognition DatasetJournal of Imaging10.3390/jimaging1011028110:11(281)Online publication date: 5-Nov-2024
  • (2024)FedFSLAR: A Federated Learning Framework for Few-shot Action Recognition2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW60836.2024.00035(270-279)Online publication date: 1-Jan-2024
  • (2024)Differentially Private Video Activity Recognition2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00652(6643-6653)Online publication date: 3-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. video representation learning
  2. video understanding

Qualifiers

  • Short-paper

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)86
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Video WeAther RecoGnition (VARG): An Intensity-Labeled Video Weather Recognition DatasetJournal of Imaging10.3390/jimaging1011028110:11(281)Online publication date: 5-Nov-2024
  • (2024)FedFSLAR: A Federated Learning Framework for Few-shot Action Recognition2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW60836.2024.00035(270-279)Online publication date: 1-Jan-2024
  • (2024)Differentially Private Video Activity Recognition2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00652(6643-6653)Online publication date: 3-Jan-2024
  • (2024)Egocentric Action Recognition by Capturing Hand-Object Contact and Object State2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00641(6527-6537)Online publication date: 3-Jan-2024
  • (2024)VR.net: A Real-world Dataset for Virtual Reality Motion Sickness ResearchIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337204430:5(2330-2336)Online publication date: May-2024
  • (2024)Impact of Annotation Modality on Label Quality and Model Performance in the Automatic Assessment of Laughter In-the-WildIEEE Transactions on Affective Computing10.1109/TAFFC.2023.326900315:2(519-534)Online publication date: Apr-2024
  • (2024)Action-Slot: Visual Action-Centric Representations for Multi-Label Atomic Activity Recognition in Traffic Scenes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01746(18451-18461)Online publication date: 16-Jun-2024
  • (2023)VOCALExplore: Pay-as-You-Go Video Data Exploration and Model BuildingProceedings of the VLDB Endowment10.14778/3625054.362505716:13(4188-4201)Online publication date: 1-Sep-2023
  • (2023)Self-Similarity is all You Need for Fast and Light-Weight Generic Event Boundary DetectionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096176(1-5)Online publication date: 4-Jun-2023
  • (2023)Assessing the determinants of larval fish strike rates using computer visionEcological Informatics10.1016/j.ecoinf.2023.10219577(102195)Online publication date: Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media