Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.09251 (cs)

[Submitted on 14 Dec 2023]

Title:VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Authors:Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

View PDF

Abstract:In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.09251 [cs.CV]
	(or arXiv:2312.09251v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.09251

Submission history

From: Jinguo Zhu [view email]
[v1] Thu, 14 Dec 2023 18:59:43 UTC (5,860 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators