Nothing Special   »   [go: up one dir, main page]

Skip to main content

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15087))

Included in the following conference series:

  • 8 Accesses

Abstract

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g., GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g. captioning), over sparse perception (e.g. detection), to dense prediction (e.g. segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models are available at https://github.com/Haiyang-W/GiT.

H. Wang and H. Tang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aakanksha, C., et al.: Palm: scaling language modeling with pathways. In: JMLR (2023)

    Google Scholar 

  2. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)

    Google Scholar 

  3. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

    Google Scholar 

  4. Alec, R., et al.: Language models are unsupervised multitask learners. In: OpenAI blog (2019)

    Google Scholar 

  5. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  6. Bao, H., et al.: Vlmo: unified vision-language pre-training with mixture-of-modality-experts. In: NeurIPS (2022)

    Google Scholar 

  7. Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b

  8. Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  9. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  10. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)

    Google Scholar 

  11. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: CVPR (2018)

    Google Scholar 

  12. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  13. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

    Google Scholar 

  14. Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  15. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  16. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. In: CVPR (2017)

    Google Scholar 

  17. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)

    Google Scholar 

  18. Chen, M., et al.: Generative pretraining from pixels. In: ICML (2020)

    Google Scholar 

  19. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. In: ICLR (2022)

    Google Scholar 

  20. Chen, T., Saxena, S., Li, L., Lin, T.Y., Fleet, D.J., Hinton, G.E.: A unified sequence interface for vision tasks. In: NeurIPS (2022)

    Google Scholar 

  21. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  22. Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: ECCV (2020)

    Google Scholar 

  23. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

    Google Scholar 

  24. Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: ICML (2021)

    Google Scholar 

  25. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)

    Google Scholar 

  26. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  27. Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: single-shot multi-level face localisation in the wild. In: CVPR (2020)

    Google Scholar 

  28. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  29. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML. PMLR (2017)

    Google Scholar 

  30. Girshick, R.: Fast r-cnn. In: ICCV (2015)

    Google Scholar 

  31. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)

    Google Scholar 

  32. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)

    Google Scholar 

  33. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)

    Google Scholar 

  34. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  35. III/4, I.W.: ISPRS 2D Semantic Labeling Contest. https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx

  36. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)

    Google Scholar 

  37. Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)

    Google Scholar 

  38. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)

    Google Scholar 

  39. Kirillov, A., et al.: Segment anything. In: ICCV (2023)

    Google Scholar 

  40. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7

  41. Kuznetsova, A., et al.: The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. In: IJCV (2020)

    Google Scholar 

  42. Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: CVPR (2023)

    Google Scholar 

  43. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)

    Google Scholar 

  44. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  45. Li, L.H., et al.: Grounded language-image pre-training. In: CVPR (2022)

    Google Scholar 

  46. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part IX, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17

  47. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  48. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  49. Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  50. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)

    Google Scholar 

  51. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

    Google Scholar 

  52. Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: UNIFIED-IO: a unified model for vision, language, and multi-modal tasks. In: ICLR (2023)

    Google Scholar 

  53. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)

    Google Scholar 

  54. Ning, J., et al.: All in tokens: unifying output space of visual tasks via soft token. In: ICCV (2023)

    Google Scholar 

  55. OpenAI. Chatgpt (2022). https://openai.com/blog/chatgpt

  56. OpenAI. Gpt-4 technical report (2023)

    Google Scholar 

  57. Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS, vol. 24 (2011)

    Google Scholar 

  58. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS, vol. 35 (2022)

    Google Scholar 

  59. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)

    Google Scholar 

  60. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  61. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. In: JMLR (2020)

    Google Scholar 

  62. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  63. Reed, S., et al.: A generalist agent. In: TMLR (2022)

    Google Scholar 

  64. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS (2015)

    Google Scholar 

  65. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  66. Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV (2019)

    Google Scholar 

  67. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)

    Google Scholar 

  68. Staal, J., Abràmoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. In: TMI (2004)

    Google Scholar 

  69. Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca

  70. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  71. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  72. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)

    Google Scholar 

  73. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  74. Wang, H., et al.: DSVT: dynamic sparse voxel transformer with rotated sets. In: CVPR (2023)

    Google Scholar 

  75. Wang, H., et al.: Unitr: a unified and efficient multi-modal transformer for bird’s-eye-view representation. In: ICCV (2023)

    Google Scholar 

  76. Wang, J., Zheng, Z., Ma, A., Lu, X., Zhong, Y.: Loveda: a remote sensing land-cover dataset for domain adaptive semantic segmentation. In: NeurIPS (2021)

    Google Scholar 

  77. Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: ICML (2022)

    Google Scholar 

  78. Wang, W., et al.: Visionllm: large language model is also an open-ended decoder for vision-centric tasks. In: NeurIPS (2023)

    Google Scholar 

  79. Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. In: CVPR (2023)

    Google Scholar 

  80. Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., Wang, Y.: Multimodal token fusion for vision transformers. In: CVPR (2022)

    Google Scholar 

  81. Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  82. Xie, E., et al.: Polarmask: single shot instance segmentation with polar representation. In: CVPR (2020)

    Google Scholar 

  83. Xu, W., Wang, H., Qi, F., Lu, C.: Explicit shape encoding for real-time instance segmentation. In: ICCV (2019)

    Google Scholar 

  84. Yamazaki, K., et al.: Aerialformer: multi-resolution transformer for aerial image segmentation. arXiv preprint arXiv:2306.06842 (2023)

  85. Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: a face detection benchmark. In: CVPR (2016)

    Google Scholar 

  86. Yang, Z., et al.: Unitab: unifying text and box outputs for grounded vision-language modeling. In: ECCV (2022)

    Google Scholar 

  87. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV. Springer (2016)

    Google Scholar 

  88. Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2022)

    Google Scholar 

  89. Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  90. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)

    Google Scholar 

  91. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  92. Zhu, J., et al.: Uni-perceiver-MOE: learning sparse generalist models with conditional MOEs. In: NeurIPS (2022)

    Google Scholar 

  93. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2020)

    Google Scholar 

  94. Zhu, X., et al.: Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: CVPR (2022)

    Google Scholar 

  95. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: CVPR (2023)

    Google Scholar 

Download references

Acknowledgement

Haiyang Wang would like to thank Mingxu Tao for helpful discussions about language modeling. We sincerely appreciate all the anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1546 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, H. et al. (2025). GiT: Towards Generalist Vision Transformer Through Universal Language Interface. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15087. Springer, Cham. https://doi.org/10.1007/978-3-031-73397-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73397-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73396-3

  • Online ISBN: 978-3-031-73397-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics