Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.19465 (cs)

[Submitted on 29 May 2024]

Title:RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Authors:Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

Abstract:Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.

Comments:	Accepted by ACL 2024 Findings
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.19465 [cs.CV]
	(or arXiv:2405.19465v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.19465

Submission history

From: Meng Cao [view email]
[v1] Wed, 29 May 2024 19:23:53 UTC (1,605 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators