Abstract
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g. captioning), over sparse perception (e.g. detection), to dense prediction (e.g. segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models are available at https://github.com/Haiyang-W/GiT.
H. Wang and H. Tang—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aakanksha, C., et al.: Palm: scaling language modeling with pathways. In: JMLR (2023)
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Alec, R., et al.: Language models are unsupervised multitask learners. In: OpenAI blog (2019)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. In: NeurIPS (2022)
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: CVPR (2018)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. In: CVPR (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
Chen, M., et al.: Generative pretraining from pixels. In: ICML (2020)
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. In: ICLR (2022)
Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. In: NeurIPS (2022)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: ICML (2021)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: CVPR (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML. PMLR (2017)
Girshick, R.: Fast r-cnn. In: ICCV (2015)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
III/4, I.W.: ISPRS 2D Semantic Labeling Contest. https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. In: IJCV (2020)
Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: CVPR (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, L.H., et al.: Grounded language-image pre-training. In: CVPR (2022)
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: UNIFIED-IO: a unified model for vision, language, and multi-modal tasks. In: ICLR (2023)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)
Ning, J., et al.: All in tokens: unifying output space of visual tasks via soft token. In: ICCV (2023)
OpenAI. Chatgpt (2022). https://openai.com/blog/chatgpt
OpenAI. Gpt-4 technical report (2023)
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, vol. 24 (2011)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS, vol. 35 (2022)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Reed, S., et al.: A generalist agent. In: TMLR (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)
Staal, J., Abràmoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. In: TMI (2004)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., et al.: DSVT: dynamic sparse voxel transformer with rotated sets. In: CVPR (2023)
Wang, H., et al.: Unitr: a unified and efficient multi-modal transformer for bird’s-eye-view representation. In: ICCV (2023)
Wang, J., Zheng, Z., Ma, A., Lu, X., Zhong, Y.: Loveda: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In: NeurIPS (2021)
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: ICML (2022)
Wang, W., et al.: Visionllm: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)
Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. In: CVPR (2023)
Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., Wang, Y.: Multimodal token fusion for vision transformers. In: CVPR (2022)
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xie, E., et al.: Polarmask: single shot instance segmentation with polar representation. In: CVPR (2020)
Xu, W., Wang, H., Qi, F., Lu, C.: Explicit shape encoding for real-time instance segmentation. In: ICCV (2019)
Yamazaki, K., et al.: Aerialformer: multi-resolution transformer for aerial image segmentation. arXiv preprint arXiv:2306.06842 (2023)
Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: a face detection benchmark. In: CVPR (2016)
Yang, Z., et al.: Unitab: unifying text and box outputs for grounded vision-language modeling. In: ECCV (2022)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV. Springer (2016)
Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2022)
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, J., et al.: Uni-perceiver-MOE: learning sparse generalist models with conditional MOEs. In: NeurIPS (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2020)
Zhu, X., et al.: Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: CVPR (2022)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)
Acknowledgement
Haiyang Wang would like to thank Mingxu Tao for helpful discussions about language modeling. We sincerely appreciate all the anonymous reviewers for their valuable suggestions.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H. et al. (2025). GiT: Towards Generalist Vision Transformer Through Universal Language Interface. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-73397-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73396-3
Online ISBN: 978-3-031-73397-0
eBook Packages: Computer ScienceComputer Science (R0)