Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.00801 (cs)

[Submitted on 31 Mar 2024 (v1), last revised 21 Jul 2024 (this version, v2)]

Title:$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Authors:Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen

Abstract:Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at this https URL.

Comments:	ECCV 2024 Camera Ready
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.00801 [cs.CV]
	(or arXiv:2404.00801v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.00801

Submission history

From: Ye Liu [view email]
[v1] Sun, 31 Mar 2024 21:17:48 UTC (20,356 KB)
[v2] Sun, 21 Jul 2024 16:17:07 UTC (9,747 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators