Computer Science > Artificial Intelligence

arXiv:2409.15657 (cs)

[Submitted on 24 Sep 2024 (v1), last revised 30 Oct 2024 (this version, v4)]

Title:M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Authors:Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu Huang, Qifan Wang, Dongfang Liu

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.

Comments:	EMNLP 2024
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2409.15657 [cs.AI]
	(or arXiv:2409.15657v4 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2409.15657

Submission history

From: Taowen Wang [view email]
[v1] Tue, 24 Sep 2024 01:40:24 UTC (2,268 KB)
[v2] Wed, 25 Sep 2024 03:24:39 UTC (2,268 KB)
[v3] Fri, 27 Sep 2024 16:24:50 UTC (2,143 KB)
[v4] Wed, 30 Oct 2024 04:38:52 UTC (2,144 KB)

Computer Science > Artificial Intelligence

Title:M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators