Abstract
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA 1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2–3 orders of magnitude more data on 8 benchmarks. Notably, our model built on LLaVA-1.5\(_{336\times 336}\) supports 6 times larger (i.e., 672 \(\times \) 1008) resolution images, and achieves 5.7 accuracy improvement on TextVQA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Note that the issue is different from the overlapping sliding windows in CNNs, since the overlaps in GPT-4V is inconsistent across different resolution images.
- 3.
Note that besides visual encoding strategies, model behaviors are also influenced by the accumulated training dynamics and RLHF. Therefore the double/quadruple effect does not dominate the results. All results are from GPT-4V on 03-05-2024.
References
Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Alayrac, J., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Antol, S., et al.: VQA: visual question answering. In: IEEE ICCV, pp. 2425–2433 (2015)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bavishi, R., et al.: Introducing our multimodal models (2023). adept.ai/blog/fuyu-8b
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org. Accessed 14 Apr 2023
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Dong, X., et al.: InternLM-XComposer2-4KHD: a pioneering large vision-language model handling resolutions from 336 pixels to 4k HD. arXiv preprint arXiv:2404.06512 (2024)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: IEEE CVPR, pp. 3608–3617 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE ICCV, pp. 2961–2969 (2017)
Hong, W., et al.: CogAgent: a visual language model for GUI agents. arXiv preprint arXiv:2312.08914 (2023)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE CVPR, pp. 6700–6709 (2019)
Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Li, B., Zhang, P., Yang, J., Zhang, Y., Pu, F., Liu, Z.: OtterHD: a high-resolution multi-modality model. arXiv preprint arXiv:2311.04219 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Li, Z., et al.: Monkey: image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607 (2023)
Lin, Z., et al.: SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., et al.: LLaVA-NeXT: improved reasoning, OCR, and world knowledge (2024). https://llava-vl.github.io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS, vol. 36 (2024)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural. Inf. Process. Syst. 35, 2507–2521 (2022)
Luo, G., Zhou, Y., Zhang, Y., Zheng, X., Sun, X., Ji, R.: Feast your eyes: mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003 (2024)
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: IEEE ICDAR, pp. 947–952 (2019)
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44
Singh, A., et al.: Towards VQA models that can read. In: IEEE CVPR, pp. 8317–8326 (2019)
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)
Ye, J., et al.: UReader: universal OCR-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023)
Ye, Q., et al.: mPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhang, Y., et al.: Beyond LLaVA-HD: diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487 (2024)
Acknowledgement
This work is supported by the National Science and Technology Major Project (2020AAA0106502) and National Natural Science Foundation of China (No. 62236004).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, Z. et al. (2025). LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-73010-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73009-2
Online ISBN: 978-3-031-73010-8
eBook Packages: Computer ScienceComputer Science (R0)