CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali¹³,
Adrian Bulat^13,14,
Brais Martinez¹³ &
…
Georgios Tzimiropoulos^13,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

European Conference on Computer Vision

28 Accesses

Abstract

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Notes

1.
https://huggingface.co/teknium/OpenHermes-2.5-Mistral 7B.

References

Detailed caption dataset. https://huggingface.co/datasets/echo840/Detailed_Caption (2024)
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Awadalla, A., et al.: OpenFlamingo: an open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
Azar, M.G., et al.: A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036 (2023)
Bai, J., et al.: Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative components with random forests. In: European Conference on Computer Vision (2014)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances on Neural Information Processing Systems (2020)
Google Scholar
Chen, J., et al.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, L., et al.: ShareGPT4v: improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
Chen, Z., et al.: InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org (2023). Accessed 14 Apr. 2023
Chu, X., et al.: MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)
Chu, X., et al.: MobileVLM V2: faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766 (2024)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., Kiela, D.: KTO: model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 (2024)
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE Conference on Computer Vision and Pattern Recognition - Workshops (2004)
Google Scholar
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394 (2023)
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 12(7), 2217–2226 (2019)
Article Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hu, H., Zhang, J., Zhao, M., Sun, Z.: CIEM: contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301 (2023)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Jain, J., Yang, J., Shi, H.: VCoder: versatile vision encoders for multimodal large language models. arXiv preprint arXiv:2312.14233 (2023)
Jiang, A.Q., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
Jiang, C., et al.: Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968 (2023)
Jing, L., Li, R., Chen, Y., Jia, M., Du, X.: FaithScore: evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477 (2023)
Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal inputs and outputs. International Conference on Machine Learning (2023)
Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision - Workshops (2013)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Learning Representations (2022)
Google Scholar
Li, L., et al.: Silkie: preference distillation for large visual language models. arXiv preprint arXiv:2312.10665 (2023)
Li, S., Lin, R., Pei, S.: Multi-modal preference alignment remedies regression of visual instruction tuning on language model. arXiv preprint arXiv:2402.10884 (2024)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gunasekar, S., Lee, Y.T.: Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)
Lin, B., et al.: MoE-LLaVA: mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947 (2024)
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: International Conference on Learning Representations (2023)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances on Neural Information Processing Systems (2024)
Google Scholar
Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P.J., Liu, J.: Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? ArXiv preprint arXiv:2307.06281 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lu, P., et al.: Learn to Explain: multimodal reasoning via thought chains for science question answering. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773 (2021)
Mitchell, E.: A note on DPO with noisy preferences & relationship to IPO (2024)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008)
Google Scholar
OpenAI: Introducing ChatGPT (2022)
Google Scholar
OpenAI: GPT-4V(ision) system card (2023)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
Google Scholar
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. In: Advances on Neural Information Processing Systems (2024)
Google Scholar
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., He, Y.: Zero-Infinity: breaking the GPU memory wall for extreme scale deep learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2021)
Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Singh, A., et al.: Towards VQA models that can read. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Stiennon, N., et al.: Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Thoppilan, R., et al.: LaMDA: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397 (2023)
Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
Article Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun Database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (2010)
Google Scholar
Xu, J., Lee, A., Sukhbaatar, S., Weston, J.: Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682 (2023)
Ye, Q., et al.: mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
You, H., et al.: Ferret: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
Zhai, B., et al.: HallE-Switch: rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023)
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343 (2023)
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., Liu, P.J.: SLiC-HF: sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 (2023)
Zhao, Z., Wang, B., Ouyang, L., Dong, X., Wang, J., He, C.: Beyond hallucinations: enhancing LVLMs through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839 (2023)
Zhou, C., et al.: LIMA: less is more for alignment. In: Advances in Neural Information Processing Systems (2024)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: IEEE Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, Y., Zhu, M., Liu, N., Ou, Z., Mou, X., Tang, J.: LLaVA-phi: efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330 (2024)
Ziegler, D.M., et al.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)

Download references

Author information

Authors and Affiliations

Samsung AI Center Cambridge, Cambridge, UK
Yassine Ouali, Adrian Bulat, Brais Martinez & Georgios Tzimiropoulos
Technical University of Iai, Iai, Romania
Adrian Bulat
Queen Mary University of London, London, UK
Georgios Tzimiropoulos

Authors

Yassine Ouali
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Bulat
View author publications
You can also search for this author in PubMed Google Scholar
Brais Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Tzimiropoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yassine Ouali .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ouali, Y., Bulat, A., Martinez, B., Tzimiropoulos, G. (2025). CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-73116-7_23
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73115-0
Online ISBN: 978-3-031-73116-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation