GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Haiyang Wang^13,14,
Hao Tang¹³,
Li Jiang¹⁴,
Shaoshuai Shi¹⁴,
Muhammad Ferjad Naeem¹⁵,
Hongsheng Li¹⁶,
Bernt Schiele¹⁴ &
…
Liwei Wang^13,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

European Conference on Computer Vision

8 Accesses

Abstract

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g. captioning), over sparse perception (e.g. detection), to dense prediction (e.g. segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models are available at https://github.com/Haiyang-W/GiT.

H. Wang and H. Tang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BRAVE: Broadening the Visual Encoding of Vision-Language Models

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

References

Aakanksha, C., et al.: Palm: scaling language modeling with pathways. In: JMLR (2023)
Google Scholar
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Google Scholar
Alec, R., et al.: Language models are unsupervised multitask learners. In: OpenAI blog (2019)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. In: NeurIPS (2022)
Google Scholar
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: CVPR (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Google Scholar
Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. In: CVPR (2017)
Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: ICML (2020)
Google Scholar
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. In: ICLR (2022)
Google Scholar
Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. In: NeurIPS (2022)
Google Scholar
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: ICML (2021)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Google Scholar
Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: CVPR (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML. PMLR (2017)
Google Scholar
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Google Scholar
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
III/4, I.W.: ISPRS 2D Semantic Labeling Contest. https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)
Google Scholar
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. In: IJCV (2020)
Google Scholar
Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: CVPR (2023)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR (2022)
Google Scholar
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: UNIFIED-IO: a unified model for vision, language, and multi-modal tasks. In: ICLR (2023)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Google Scholar
Ning, J., et al.: All in tokens: unifying output space of visual tasks via soft token. In: ICCV (2023)
Google Scholar
OpenAI. Chatgpt (2022). https://openai.com/blog/chatgpt
OpenAI. Gpt-4 technical report (2023)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, vol. 24 (2011)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS, vol. 35 (2022)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Reed, S., et al.: A generalist agent. In: TMLR (2022)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)
Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)
Google Scholar
Staal, J., Abràmoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. In: TMI (2004)
Google Scholar
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, H., et al.: DSVT: dynamic sparse voxel transformer with rotated sets. In: CVPR (2023)
Google Scholar
Wang, H., et al.: Unitr: a unified and efficient multi-modal transformer for bird’s-eye-view representation. In: ICCV (2023)
Google Scholar
Wang, J., Zheng, Z., Ma, A., Lu, X., Zhong, Y.: Loveda: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In: NeurIPS (2021)
Google Scholar
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: ICML (2022)
Google Scholar
Wang, W., et al.: Visionllm: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)
Google Scholar
Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. In: CVPR (2023)
Google Scholar
Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., Wang, Y.: Multimodal token fusion for vision transformers. In: CVPR (2022)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xie, E., et al.: Polarmask: single shot instance segmentation with polar representation. In: CVPR (2020)
Google Scholar
Xu, W., Wang, H., Qi, F., Lu, C.: Explicit shape encoding for real-time instance segmentation. In: ICCV (2019)
Google Scholar
Yamazaki, K., et al.: Aerialformer: multi-resolution transformer for aerial image segmentation. arXiv preprint arXiv:2306.06842 (2023)
Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: a face detection benchmark. In: CVPR (2016)
Google Scholar
Yang, Z., et al.: Unitab: unifying text and box outputs for grounded vision-language modeling. In: ECCV (2022)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV. Springer (2016)
Google Scholar
Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2022)
Google Scholar
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, J., et al.: Uni-perceiver-MOE: learning sparse generalist models with conditional MOEs. In: NeurIPS (2022)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar
Zhu, X., et al.: Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: CVPR (2022)
Google Scholar
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Google Scholar

Download references

Acknowledgement

Haiyang Wang would like to thank Mingxu Tao for helpful discussions about language modeling. We sincerely appreciate all the anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Center for Machine Learning Research, Peking University, Beijing, China
Haiyang Wang, Hao Tang & Liwei Wang
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Haiyang Wang, Li Jiang, Shaoshuai Shi & Bernt Schiele
ETH Zurich, Zurich, Switzerland
Muhammad Ferjad Naeem
The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
Hongsheng Li
State Key Laboratory of General Artificial Intelligence, SIST, Peking University, Beijing, China
Liwei Wang

Authors

Haiyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Li Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoshuai Shi
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Ferjad Naeem
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Bernt Schiele
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1546 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H. et al. (2025). GiT: Towards Generalist Vision Transformer Through Universal Language Interface. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-73397-0_4
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BRAVE: Broadening the Visual Encoding of Vision-Language Models

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1546 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BRAVE: Broadening the Visual Encoding of Vision-Language Models

UMG-CLIP: A Unified Multi-granularity Vision Generalist for Open-World Understanding

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

References

Acknowledgement

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1546 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation