SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Ziyi Lin^13,14,
Dongyang Liu¹⁴,
Renrui Zhang^13,14,
Peng Gao¹⁴,
Longtian Qiu^14,15,
Han Xiao¹⁴,
Han Qiu¹⁴,
Wenqi Shao¹⁴,
Keqin Chen¹⁴,
Jiaming Han^13,14,
Siyuan Huang¹⁴,
Yichi Zhang¹⁴,
Xuming He¹⁵,
Yu Qiao¹⁴ &
…
Hongsheng Li^13,14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15120))

Included in the following conference series:

European Conference on Computer Vision

14 Accesses
3 Citations

Abstract

We present , a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. We further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. Based on our proposed joint mixing, exhibits superior multi-modal understanding capabilities on a wide range of applications, with highlighted fine-grained visual recognition abilities such as region-level understanding, caption grounding, document layout detection, and human pose estimation. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Z. Lin, D. Liu, R. Zhang, P. Gao and L. Qiu—Equal contribution.

P. Gao, Y. Qiao and H. Li—Equal advisory.

P. Gao—Project leader.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BRAVE: Broadening the Visual Encoding of Vision-Language Models

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

References

Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123, 4–31 (2015)
Article MathSciNet Google Scholar
Aiello, E., Yu, L., Nie, Y., Aghajanyan, A., Oguz, B.: Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564 (2023)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Google Scholar
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv arXiv:2308.12966 (2023)
Bai, S., et al.: Touchstone: evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023)
Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030–18040 (2022)
Google Scholar
Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V.C.Y., Elhoseiny, M.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Contributors, O.: OpenCompass: a universal evaluation platform for foundation models (2023). https://github.com/open-compass/opencompass
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv arXiv:2305.06500 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dong, R., et al.: DreamLLM: synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)
Dong, X., et al.: CLIP itself is a strong fine-tuner: achieving 85.7% and 88.0% top-1 accuracy with ViT-B and ViT-L on ImageNet. arXiv preprint arXiv:2212.06138 (2022)
Dosovitskiy, A., et al.: An image is worth 16$\,\times \,$16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Douillard, A., et al.: Diloco: distributed low-communication training of language models. arXiv preprint arXiv:2311.08105 (2023)
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)
Girdhar, R., et al.: ImageBind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023)
Google Scholar
Google: Bard (2023). https://bard.google.com/
Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv arXiv:2309.00615 (2023)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
Google Scholar
Gurari, D., Li, Q., Stangl, A., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: answering visual questions from blind people. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Google Scholar
Han, J., et al.: ImageBind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. arXiv preprint arXiv:2307.12981 (2023)
Huang, C., Liu, Q., Lin, B.Y., Pang, T., Du, C., Lin, M.: LoraHub: efficient cross-task generalization via dynamic LoRA composition. arXiv preprint arXiv:2307.13269 (2023)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019)
Google Scholar
Kazemzadeh, S., Ordonez, V., andre Matten, M., Berg, T.L.: ReferitGame: referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing (2014)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)
Article MathSciNet Google Scholar
Li, B., et al.: MIMIC-IT: multi-modal in-context instruction tuning. arXiv arXiv:2306.05425 (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv arXiv:2305.03726 (2023)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-bench: benchmarking multimodal LLMs with generative comprehension. arXiv arXiv:2307.16125 (2023)
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, M., et al.: Branch-train-merge: embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306 (2022)
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European Conference on Computer Vision, pp. 280–296. Springer (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Trans. Assoc. Comput. Linguist. 11, 635–651 (2023)
Article Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv arXiv:2303.05499 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Lu, P., et al.: MathVista: evaluating math reasoning in visual contexts with GPT-4V, bard, and other large multimodal models. arXiv arXiv:2310.02255 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Lu, P., et al.: IconQA: a new benchmark for abstract diagram understanding and visual language reasoning. In: The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks (2021)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.P.: Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20 (2015)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199 (2019)
Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR (2017)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952 (2019)
Google Scholar
OpenAI: Chatgpt (2023). https://chat.openai.com
OpenAI: GPT-4 technical report. arXiv arXiv:2303.08774 (2023)
OpenAI: Vision - OpenAI api (2023). https://platform.openai.com/docs/guides/vision
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
Google Scholar
Penedo, G., et al.: The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)
Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vision 123, 74–93 (2015)
Article MathSciNet Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2021)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Schuhmann, C., Köpf, A., Vencu, R., Coombes, T., Beaumont, R.: Laion-coco (2022). https://laion.ai/blog/laion-coco/
Schuhmann, C., et al.: LAION-400M: open dataset of clip-filtered 400 Million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: a benchmark for visual question answering using world knowledge. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_9
Shao, W., et al.: TinyLVLM-eHub: early multimodal experiments with bard. arXiv preprint arXiv:2308.03729 (2023)
ShareGPT: ShareGPT (2023). https://sharegpt.com/
Shukor, M., Dancette, C., Rame, A., Cord, M.: Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal (2023)
Google Scholar
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. arXiv arXiv:2003.12462 (2020)
Singh, A., et al.: Towards VQA models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318 (2019)
Google Scholar
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: PandaGPT: one model to instruction-follow them all. arXiv arXiv:2305.16355 (2023)
Sung, Y.L., Li, L., Lin, K., Gan, Z., Bansal, M., Wang, L.: An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933 (2023)
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161 (2021)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, K., et al.: Mathcoder: seamless code integration in LLMs for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731 (2023)
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)
Google Scholar
Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)
Wen, S., Fang, G., Zhang, R., Gao, P., Dong, H., Metaxas, D.: Improving compositional text-to-image generation with large vision-language models. arXiv preprint arXiv:2310.06311 (2023)
Woo, S., et al.: ConvNext V2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142 (2023)
Google Scholar
Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: International Conference on Machine Learning, pp. 23965–23998. PMLR (2022)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Wu, C., et al.: $\pi $-tuning: transferring multimodal foundation models with optimal multi-task interpolation. In: International Conference on Machine Learning, pp. 37713–37727. PMLR (2023)
Google Scholar
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. arXiv arXiv:2308.16911 (2023)
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336 (2023)
Google Scholar
Yang, E., et al.: Adamerging: adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575 (2023)
Yang, Z., et al.: MM-react: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv arXiv:2308.02490 (2023)
Zhang, R., et al.: LLaMA-Adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, R., et al.: Prompt, generate, then cache: cascade of foundation models makes strong few-shot learners. In: CVPR 2023 (2023)
Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Google Scholar
Zhou, A., et al.: Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921 (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, X., Zhang, R., He, B., Zeng, Z., Zhang, S., Gao, P.: PointCLIP V2: adapting clip for powerful 3D open-world learning. arXiv preprint arXiv:2211.11682 (2022)

Download references

Acknowledgements

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100 and 2022ZD0160102, by National Natural Science Foundation of China (Grant No. 62206272), by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR and by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Multimedia Laboratory, The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
Ziyi Lin, Renrui Zhang, Jiaming Han & Hongsheng Li
Shanghai AI Laboratory, Shanghai, China
Ziyi Lin, Dongyang Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Yu Qiao & Hongsheng Li
ShanghaiTech University, Shanghai, China
Longtian Qiu & Xuming He
Centre for Perceptual and Interactive Intelligence Limited, Hong Kong, China
Hongsheng Li

Authors

Ziyi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dongyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Renrui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Longtian Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Han Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Han Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Wenqi Shao
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Han
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yichi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuming He
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Gao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3773 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, Z. et al. (2025). SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-73033-7_3
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73032-0
Online ISBN: 978-3-031-73033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BRAVE: Broadening the Visual Encoding of Vision-Language Models

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3773 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BRAVE: Broadening the Visual Encoding of Vision-Language Models

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3773 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation