Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.12293 (cs)

[Submitted on 26 Apr 2022]

Title:Contrastive Language-Action Pre-training for Temporal Localization

Authors:Mengmeng Xu, Erhan Gundogdu, Maksim Lapin, Bernard Ghanem, Michael Donoser, Loris Bazzani

View PDF

Abstract:Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.

Comments:	18 pages, 4 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.12293 [cs.CV]
	(or arXiv:2204.12293v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.12293

Submission history

From: Erhan Gundogdu [view email]
[v1] Tue, 26 Apr 2022 13:17:50 UTC (1,932 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contrastive Language-Action Pre-training for Temporal Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contrastive Language-Action Pre-training for Temporal Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators