Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15047 (cs)

[Submitted on 21 Jul 2024 (v1), last revised 23 Jul 2024 (this version, v2)]

Title:End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Authors:Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan Zhao

Abstract:Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2407.15047 [cs.CV]
	(or arXiv:2407.15047v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15047

Submission history

From: Jianxin Liang [view email]
[v1] Sun, 21 Jul 2024 04:09:37 UTC (16,709 KB)
[v2] Tue, 23 Jul 2024 14:56:22 UTC (16,709 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators