Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.02574 (cs)

[Submitted on 4 Feb 2024]

Title:Spatio-temporal Prompting Network for Robust Video Feature Extraction

Authors:Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos Zafeiriou, Yang Hua

Abstract:Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2402.02574 [cs.CV]
	(or arXiv:2402.02574v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.02574
Journal reference:	2023 International Conference on Computer Vision (ICCV) 13541-13551
Related DOI:	https://doi.org/10.1109/ICCV51070.2023.01250

Submission history

From: Guanxiong Sun [view email]
[v1] Sun, 4 Feb 2024 17:52:04 UTC (8,097 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Spatio-temporal Prompting Network for Robust Video Feature Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Spatio-temporal Prompting Network for Robust Video Feature Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators