Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.12391 (cs)

[Submitted on 21 Nov 2023]

Title:From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

Authors:Jiaxin Ge, Sanjay Subramanian, Trevor Darrell, Boyi Li

View PDF

Abstract:Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm. Our method iteratively computes visual features (conditioned on the text input), an answer, and an explanation, to improve the explanation quality step by step until the answer converges. We find that this multi-step approach guides the model to correct its own answers and outperforms single-step explanation generation. Furthermore, explanations generated by ReVisE also serve as valuable annotations for few-shot self-training. Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets, underscoring the efficacy and data-efficiency of our method.

Comments:	EMNLP 2023 Main
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.12391 [cs.CV]
	(or arXiv:2311.12391v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.12391

Submission history

From: Jiaxin Ge [view email]
[v1] Tue, 21 Nov 2023 07:02:32 UTC (5,920 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators