Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475431acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Published: 17 October 2021 Publication History

Abstract

The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (eg. including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named Victor, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, Victor constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. Victor is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained Victor model to a series of downstream applications and demonstrate its superior performance, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL.

References

[1]
Hugo Larochelle Aaron Courville Atousa Torabi, Christopher Pal. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arxiv: 1503.01070
[2]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. IEEE.
[3]
Xusong Chen, Chenyi Lei, Dong Liu, Guoxin Wang, Haihong Tang, Zheng-Jun Zha, and Houqiang Li. 2021. E-Commerce Storytelling Recommendation Using Attentional Domain-Transfer Network and Adversarial Pre-Training. In Transactions on Multimedia. IEEE.
[4]
Xusong Chen, Dong Liu, Chenyi Lei, Rui Li, Zheng-Jun Zha, and Zhiwei Xiong. 2019. BERT4SessRec: Content-Based Video Relevance Prediction with Bidirectional Encoder Representations from Transformer. In MM. ACM, 2597--2601.
[5]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 b. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
[6]
Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. 2013. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In CVPR. IEEE.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv: 1810.04805
[9]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.
[10]
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NIPS.
[11]
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In ICML.
[12]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arxiv: 2004.00849
[13]
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, and et al. 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. arxiv: 2103.06561
[14]
Yuxin Wu Saining Xie Ross Girshick Kaiming He, Haoqi Fan. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. IEEE.
[15]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, and et al. 2017. The Kinetics Human Action Video Dataset. arxiv: 1705.06950
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332
[18]
Chenyi Lei, Yong Liu, Lingzi Zhang, Guoxin Wang, Haihong Tang, Houqiang Li, and Chunyan Miao. 2021 b. SEMI: A Sequential Multi-Modal Information Transfer Network for E-Commerce Micro-Video Recommendations. In KDD. ACM.
[19]
Chenyi Lei, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. 2020. Multi-Question Learning for Visual Question Answering. In AAAI.
[20]
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. 2021 a. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR. IEEE.
[21]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. 2020 b. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. In AAAI.
[22]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020 a. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In EMNLP.
[23]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020 c. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
[24]
Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.
[25]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. arxiv: 1604.02748
[26]
Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, and et al. 2021. M6: A Chinese Multimodal Pretrainer. arxiv: 2103.00823
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context. arxiv: 1405.0312
[28]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv: 1907.11692
[29]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NIPS.
[30]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 1906.05743
[31]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE.
[32]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arxiv: 2007.02375
[33]
Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, and Andrea Vedaldi. 2021. Support-Set Bottlenecks for Video-Text Representation Learning. In ICLR.
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv: 2103.00020
[35]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arxiv: 1910.10683
[36]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv: 2102.12092
[37]
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A Dataset for Movie Description. arxiv: 1501.02530
[38]
Gunnar A. Sigurdsson, Gu l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes Crowdsourcing Data Collection for Activity Understanding. In ECCV.
[39]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.
[40]
Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019 a. Learning Video Representations using Contrastive Bidirectional Transformer. arxiv: 2002.06353
[41]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 b. Videobert: A joint model for video and language representation learning. In ICCV. IEEE.
[42]
Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.
[43]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. In AAAI.
[44]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arxiv: 1602.07261
[45]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP.
[46]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arxiv: 1807.03748
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NIPS. ACM, 6000--6010.
[48]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV.
[49]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144
[50]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE.
[51]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NIPS.
[52]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2018a. Deep Interest Evolution Network for Click-Through Rate Prediction. In KDD. ACM.
[53]
Luowei Zhou, Jingjing Liu, Yu Cheng, Zhe Gan, and Lei Zhang. 2021 a. CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning. arxiv: 2104.00285
[54]
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018b. Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI.
[55]
Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018c. End-to-End Dense Video Captioning with Masked Transformer. In CVPR. IEEE.
[56]
Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021 b. UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training. In CVPR.

Cited By

View all
  • (2024)Multimodal Recommender Systems: A SurveyACM Computing Surveys10.1145/369546157:2(1-17)Online publication date: 10-Oct-2024
  • (2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
  • (2024)Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor DetectionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658094(442-450)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. contrastive learning
      2. multimodal pre-training
      3. video and language analysis

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 20 - 24, 2021
      Virtual Event, China

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)77
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 23 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multimodal Recommender Systems: A SurveyACM Computing Surveys10.1145/369546157:2(1-17)Online publication date: 10-Oct-2024
      • (2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
      • (2024)Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor DetectionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658094(442-450)Online publication date: 30-May-2024
      • (2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
      • (2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
      • (2024)Boosting Transferability in Vision-Language Attacks via Diversification Along the Intersection Region of Adversarial TrajectoryComputer Vision – ECCV 202410.1007/978-3-031-72998-0_25(442-460)Online publication date: 30-Sep-2024
      • (2024)Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation LearningComputer Vision – ECCV 202410.1007/978-3-031-72989-8_5(77-98)Online publication date: 26-Oct-2024
      • (2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
      • (2023)Attentive Snippet Prompting for Video RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.332150326(4348-4359)Online publication date: 2-Oct-2023
      • (2023)Simultaneously Training and Compressing Vision-and-Language Pre-Training ModelIEEE Transactions on Multimedia10.1109/TMM.2022.323325825(8194-8203)Online publication date: 1-Jan-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media