Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17647 (cs)

[Submitted on 29 Nov 2023 (v1), last revised 10 Jun 2024 (this version, v2)]

Title:Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Authors:Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Yejin Choi

Abstract:Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.

Comments:	Github: this https URL, Model and Data: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2311.17647 [cs.CV]
	(or arXiv:2311.17647v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17647

Submission history

From: Yujie Lu [view email]
[v1] Wed, 29 Nov 2023 14:08:53 UTC (9,400 KB)
[v2] Mon, 10 Jun 2024 23:39:24 UTC (3,380 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators