Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.02662 (cs)

[Submitted on 5 Feb 2024]

Title:Image-Caption Encoding for Improving Zero-Shot Generalization

Authors:Eric Yang Yu, Christopher Liao, Sathvik Ravi, Theodoros Tsiligkaridis, Brian Kulis

View PDF

Abstract:Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However, a persistent issue of these models for image classification is their out-of-distribution (OOD) generalization capabilities. We first show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes. In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption Encoding (ICE) method, a straightforward approach that directly enforces consistency between the image-conditioned and caption-conditioned predictions at evaluation time only. Intuitively, we take advantage of unique properties of the generated captions to guide our local search for the correct class label within the Top-K predicted classes. We show that our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets. Our code: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2402.02662 [cs.CV]
	(or arXiv:2402.02662v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.02662

Submission history

From: Christopher Liao [view email]
[v1] Mon, 5 Feb 2024 01:14:07 UTC (2,784 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Image-Caption Encoding for Improving Zero-Shot Generalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Image-Caption Encoding for Improving Zero-Shot Generalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators