Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.13250 (cs)

[Submitted on 25 Jul 2023]

Title:Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Authors:Yi Cheng, Hehe Fan, Dongyun Lin, Ying Sun, Mohan Kankanhalli, Joo-Hwee Lim

View PDF

Abstract:The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

Comments:	under review
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.13250 [cs.CV]
	(or arXiv:2307.13250v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.13250

Submission history

From: Yi Cheng [view email]
[v1] Tue, 25 Jul 2023 04:41:32 UTC (1,290 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators