Nothing Special   »   [go: up one dir, main page]

Skip to main content

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We present , a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. We further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. Based on our proposed joint mixing, exhibits superior multi-modal understanding capabilities on a wide range of applications, with highlighted fine-grained visual recognition abilities such as region-level understanding, caption grounding, document layout detection, and human pose estimation. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Z. Lin, D. Liu, R. Zhang, P. Gao and L. Qiu—Equal contribution.

P. Gao, Y. Qiao and H. Li—Equal advisory.

P. Gao—Project leader.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123, 4–31 (2015)

    Article  MathSciNet  Google Scholar 

  2. Aiello, E., Yu, L., Nie, Y., Aghajanyan, A., Oguz, B.: Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564 (2023)

  3. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)

    Google Scholar 

  4. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv arXiv:2308.12966 (2023)

  5. Bai, S., et al.: Touchstone: evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890 (2023)

  6. Bavishi, R., et al.: Introducing our multimodal models (2023). https://www.adept.ai/blog/fuyu-8b

  7. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  8. Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030–18040 (2022)

    Google Scholar 

  9. Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V.C.Y., Elhoseiny, M.: MiniGPT-V2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  10. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  11. Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)

  12. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  13. Contributors, O.: OpenCompass: a universal evaluation platform for foundation models (2023). https://github.com/open-compass/opencompass

  14. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv arXiv:2305.06500 (2023)

  15. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  16. Dong, R., et al.: DreamLLM: synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)

  17. Dong, X., et al.: CLIP itself is a strong fine-tuner: achieving 85.7% and 88.0% top-1 accuracy with ViT-B and ViT-L on ImageNet. arXiv preprint arXiv:2212.06138 (2022)

  18. Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  19. Douillard, A., et al.: Diloco: distributed low-communication training of language models. arXiv preprint arXiv:2311.08105 (2023)

  20. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  21. Gao, P., et al.: LLaMA-Adapter V2: parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  22. Girdhar, R., et al.: ImageBind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023)

    Google Scholar 

  23. Google: Bard (2023). https://bard.google.com/

  24. Guo, Z., et al.: Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following. arXiv arXiv:2309.00615 (2023)

  25. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)

    Google Scholar 

  26. Gurari, D., Li, Q., Stangl, A., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: answering visual questions from blind people. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)

    Google Scholar 

  27. Han, J., et al.: ImageBind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)

  28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  29. Hong, Y., et al.: 3D-LLM: injecting the 3D world into large language models. arXiv preprint arXiv:2307.12981 (2023)

  30. Huang, C., Liu, Q., Lin, B.Y., Pang, T., Du, C., Lin, M.: LoraHub: efficient cross-task generalization via dynamic LoRA composition. arXiv preprint arXiv:2307.13269 (2023)

  31. Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)

  32. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019)

    Google Scholar 

  33. Kazemzadeh, S., Ordonez, V., andre Matten, M., Berg, T.L.: ReferitGame: referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing (2014)

    Google Scholar 

  34. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  35. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  36. Li, B., et al.: MIMIC-IT: multi-modal in-context instruction tuning. arXiv arXiv:2306.05425 (2023)

  37. Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: a multi-modal model with in-context instruction tuning. arXiv arXiv:2305.03726 (2023)

  38. Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-bench: benchmarking multimodal LLMs with generative comprehension. arXiv arXiv:2307.16125 (2023)

  39. Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T.: Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  40. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  41. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  42. Li, M., et al.: Branch-train-merge: embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306 (2022)

  43. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European Conference on Computer Vision, pp. 280–296. Springer (2022). https://doi.org/10.1007/978-3-031-20077-9_17

  44. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  45. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  46. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  47. Liu, F., Emerson, G.E.T., Collier, N.: Visual spatial reasoning. Trans. Assoc. Comput. Linguist. 11, 635–651 (2023)

    Article  Google Scholar 

  48. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv arXiv:2310.03744 (2023)

  49. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  50. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv arXiv:2303.05499 (2023)

  51. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  52. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  53. Lu, P., et al.: MathVista: evaluating math reasoning in visual contexts with GPT-4V, bard, and other large multimodal models. arXiv arXiv:2310.02255 (2023)

  54. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  55. Lu, P., et al.: IconQA: a new benchmark for abstract diagram understanding and visual language reasoning. In: The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks (2021)

    Google Scholar 

  56. Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.P.: Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20 (2015)

    Google Scholar 

  57. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199 (2019)

    Google Scholar 

  58. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR (2017)

    Google Scholar 

  59. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952 (2019)

    Google Scholar 

  60. OpenAI: Chatgpt (2023). https://chat.openai.com

  61. OpenAI: GPT-4 technical report. arXiv arXiv:2303.08774 (2023)

  62. OpenAI: Vision - OpenAI api (2023). https://platform.openai.com/docs/guides/vision

  63. Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  64. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)

    Google Scholar 

  65. Penedo, G., et al.: The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)

  66. Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023)

  67. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  68. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vision 123, 74–93 (2015)

    Article  MathSciNet  Google Scholar 

  69. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  70. Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  71. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  72. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2021)

    Google Scholar 

  73. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  74. Schuhmann, C., Köpf, A., Vencu, R., Coombes, T., Beaumont, R.: Laion-coco (2022). https://laion.ai/blog/laion-coco/

  75. Schuhmann, C., et al.: LAION-400M: open dataset of clip-filtered 400 Million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  76. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: a benchmark for visual question answering using world knowledge. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_9

  77. Shao, W., et al.: TinyLVLM-eHub: early multimodal experiments with bard. arXiv preprint arXiv:2308.03729 (2023)

  78. ShareGPT: ShareGPT (2023). https://sharegpt.com/

  79. Shukor, M., Dancette, C., Rame, A., Cord, M.: Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal (2023)

    Google Scholar 

  80. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. arXiv arXiv:2003.12462 (2020)

  81. Singh, A., et al.: Towards VQA models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318 (2019)

    Google Scholar 

  82. Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: PandaGPT: one model to instruction-follow them all. arXiv arXiv:2305.16355 (2023)

  83. Sung, Y.L., Li, L., Lin, K., Gan, Z., Bansal, M., Wang, L.: An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933 (2023)

  84. Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161 (2021)

  85. Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca

  86. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  87. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  88. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  89. Wang, K., et al.: Mathcoder: seamless code integration in LLMs for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731 (2023)

  90. Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)

    Google Scholar 

  91. Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)

  92. Wen, S., Fang, G., Zhang, R., Gao, P., Dong, H., Metaxas, D.: Improving compositional text-to-image generation with large vision-language models. arXiv preprint arXiv:2310.06311 (2023)

  93. Woo, S., et al.: ConvNext V2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142 (2023)

    Google Scholar 

  94. Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: International Conference on Machine Learning, pp. 23965–23998. PMLR (2022)

    Google Scholar 

  95. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)

  96. Wu, C., et al.: \(\pi \)-tuning: transferring multimodal foundation models with optimal multi-task interpolation. In: International Conference on Machine Learning, pp. 37713–37727. PMLR (2023)

    Google Scholar 

  97. Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: PointLLM: empowering large language models to understand point clouds. arXiv arXiv:2308.16911 (2023)

  98. Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336 (2023)

    Google Scholar 

  99. Yang, E., et al.: Adamerging: adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575 (2023)

  100. Yang, Z., et al.: MM-react: prompting ChatGPT for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023)

  101. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  102. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  103. Yu, T., et al.: Inpaint anything: segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023)

  104. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv arXiv:2308.02490 (2023)

  105. Zhang, R., et al.: LLaMA-Adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)

  106. Zhang, R., et al.: Prompt, generate, then cache: cascade of foundation models makes strong few-shot learners. In: CVPR 2023 (2023)

    Google Scholar 

  107. Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  108. Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

  109. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

    Google Scholar 

  110. Zhou, A., et al.: Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921 (2023)

  111. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  112. Zhu, X., Zhang, R., He, B., Zeng, Z., Zhang, S., Gao, P.: PointCLIP V2: adapting clip for powerful 3D open-world learning. arXiv preprint arXiv:2211.11682 (2022)

Download references

Acknowledgements

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100 and 2022ZD0160102, by National Natural Science Foundation of China (Grant No. 62206272), by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR and by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Gao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3773 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, Z. et al. (2025). SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15120. Springer, Cham. https://doi.org/10.1007/978-3-031-73033-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73033-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73032-0

  • Online ISBN: 978-3-031-73033-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics