Nothing Special   »   [go: up one dir, main page]

Skip to main content

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

  • 28 Accesses

Abstract

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/teknium/OpenHermes-2.5-Mistral 7B.

References

  1. Detailed caption dataset. https://huggingface.co/datasets/echo840/Detailed_Caption (2024)

  2. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. Awadalla, A., et al.: OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

  4. Azar, M.G., et al.: A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036 (2023)

  5. Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  6. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: European Conference on Computer Vision (2014)

    Google Scholar 

  7. Brown, T., et al.: Language models are few-shot learners. In: Advances on Neural Information Processing Systems (2020)

    Google Scholar 

  8. Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  9. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  10. Chen, L., et al.: ShareGPT4v: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)

  11. Chen, Z., et al.: InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023)

  12. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (2023). Accessed 14 Apr. 2023

  13. Chu, X., et al.: MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)

  14. Chu, X., et al.: MobileVLM V2: faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766 (2024)

  15. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  16. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)

  17. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  18. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., Kiela, D.: KTO: model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 (2024)

  19. Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)

  20. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE Conference on Computer Vision and Pattern Recognition - Workshops (2004)

    Google Scholar 

  21. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  22. Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394 (2023)

  23. Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 12(7), 2217–2226 (2019)

    Article  Google Scholar 

  24. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  25. Hu, H., Zhang, J., Zhao, M., Sun, Z.: CIEM: contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301 (2023)

  26. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  27. Jain, J., Yang, J., Shi, H.: VCoder: versatile vision encoders for multimodal large language models. arXiv preprint arXiv:2312.14233 (2023)

  28. Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

  29. Jiang, C., et al.: Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968 (2023)

  30. Jing, L., Li, R., Chen, Y., Jia, M., Du, X.: FaithScore: evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477 (2023)

  31. Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal inputs and outputs. International Conference on Machine Learning (2023)

    Google Scholar 

  32. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision - Workshops (2013)

    Google Scholar 

  33. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  34. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Learning Representations (2022)

    Google Scholar 

  35. Li, L., et al.: Silkie: preference distillation for large visual language models. arXiv preprint arXiv:2312.10665 (2023)

  36. Li, S., Lin, R., Pei, S.: Multi-modal preference alignment remedies regression of visual instruction tuning on language model. arXiv preprint arXiv:2402.10884 (2024)

  37. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)

  38. Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., Lee, Y.T.: Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)

  39. Lin, B., et al.: MoE-LLaVA: mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024)

  40. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: International Conference on Learning Representations (2023)

    Google Scholar 

  41. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  42. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances on Neural Information Processing Systems (2024)

    Google Scholar 

  43. Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P.J., Liu, J.: Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657 (2023)

  44. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? ArXiv preprint arXiv:2307.06281 (2023)

  45. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  46. Lu, P., et al.: Learn to Explain: multimodal reasoning via thought chains for science question answering. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  47. Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773 (2021)

  48. Mitchell, E.: A note on DPO with noisy preferences & relationship to IPO (2024)

    Google Scholar 

  49. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008)

    Google Scholar 

  50. OpenAI: Introducing ChatGPT (2022)

    Google Scholar 

  51. OpenAI: GPT-4V(ision) system card (2023)

    Google Scholar 

  52. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  53. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)

    Google Scholar 

  54. Paszke, A., et al.: Automatic differentiation in PyTorch (2017)

    Google Scholar 

  55. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  56. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  57. Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: Advances on Neural Information Processing Systems (2024)

    Google Scholar 

  58. Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., He, Y.: Zero-Infinity: breaking the GPU memory wall for extreme scale deep learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2021)

    Google Scholar 

  59. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020)

    Google Scholar 

  60. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  61. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  62. Singh, A., et al.: Towards VQA models that can read. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  63. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  64. Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  65. Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)

  66. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  67. Thoppilan, R., et al.: LaMDA: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022)

  68. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  69. Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  70. Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)

  71. Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)

  72. Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)

  73. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)

    Article  Google Scholar 

  74. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun Database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)

    Google Scholar 

  75. Xu, J., Lee, A., Sukhbaatar, S., Weston, J.: Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682 (2023)

  76. Ye, Q., et al.: mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  77. You, H., et al.: Ferret: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)

  78. Zhai, B., et al.: HallE-Switch: rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023)

  79. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343 (2023)

  80. Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  81. Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

  82. Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., Liu, P.J.: SLiC-HF: sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 (2023)

  83. Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: enhancing LVLMs through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)

  84. Zhou, C., et al.: LIMA: less is more for alignment. In: Advances in Neural Information Processing Systems (2024)

    Google Scholar 

  85. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: IEEE Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  86. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  87. Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., Tang, J.: LLaVA-phi: efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330 (2024)

  88. Ziegler, D.M., et al.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yassine Ouali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ouali, Y., Bulat, A., Martinez, B., Tzimiropoulos, G. (2025). CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73116-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73115-0

  • Online ISBN: 978-3-031-73116-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics