Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.04979 (cs)

[Submitted on 9 Dec 2022 (v1), last revised 15 Mar 2023 (this version, v3)]

Title:VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Authors:Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu

View PDF

Abstract:We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Comments:	Tech report. arXiv v3: update text
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2212.04979 [cs.CV]
	(or arXiv:2212.04979v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.04979

Submission history

From: Shen Yan [view email]
[v1] Fri, 9 Dec 2022 16:39:09 UTC (398 KB)
[v2] Wed, 1 Feb 2023 22:32:31 UTC (410 KB)
[v3] Wed, 15 Mar 2023 06:48:23 UTC (570 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators