Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.14890 (cs)

[Submitted on 23 Apr 2024]

Title:DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Authors:Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang

Abstract:As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.14890 [cs.CV]
	(or arXiv:2404.14890v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.14890

Submission history

From: Chen Ju [view email]
[v1] Tue, 23 Apr 2024 10:17:42 UTC (3,018 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators