Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.16050 (cs)

[Submitted on 25 Feb 2024 (v1), last revised 3 Oct 2024 (this version, v2)]

Title:Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Authors:Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, Zilong Zheng

Abstract:Despite progress in multimodal large language models (MLLMs), the challenge of interpreting long-form videos in response to linguistic queries persists, largely due to the inefficiency in temporal grounding and limited pre-trained context window size. In this work, we introduce Temporal Grounding Bridge (TGB), a novel framework that bootstraps MLLMs with advanced temporal grounding capabilities and broadens their contextual scope. Our framework significantly enhances the temporal capabilities of current MLLMs through three key innovations: an efficient multi-span temporal grounding algorithm applied to low-dimension temporal features projected from flow; a multimodal length extrapolation training paradigm that utilizes low-dimension temporal features to extend the training context window size; and a bootstrapping framework that bridges our model with pluggable MLLMs without requiring annotation. We validate TGB across seven video benchmarks and demonstrate substantial performance improvements compared with prior MLLMs. Notably, our model, initially trained on sequences of four frames, effectively handles sequences up to 16 longer without sacrificing performance, highlighting its scalability and effectiveness in real-world applications. Our code is publicly available at this https URL

Comments:	To appear at EMNLP 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2402.16050 [cs.CV]
	(or arXiv:2402.16050v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.16050

Submission history

From: Yuxuan Wang [view email]
[v1] Sun, 25 Feb 2024 10:27:46 UTC (26,753 KB)
[v2] Thu, 3 Oct 2024 09:24:56 UTC (28,657 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators