Nothing Special   »   [go: up one dir, main page]

Skip to main content

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and the LLM endows its high reasoning ability to VLMs. It leads VLMs to achieve high performance on wide benchmarks without fine-tuning, exhibiting zero or few-shot capability. However, recent studies show that VLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from VLMs. To enhance trustworthiness and better tackle the hallucination of VLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is to manipulate visual scene information by image editing models and to design the metrics based on scene changes. This allows us to clearly assess whether VLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize image-wise object relationship by virtue of our two-axis view: vision and text. Upon evaluating VLMs with our dataset, we observed that our metrics reveal different aspects of VLM hallucination that have not been reported before. Project page: https://beafbench.github.io/

Authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv (2023)

    Google Scholar 

  2. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  3. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: PIQA: reasoning about physical commonsense in natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)

    Google Scholar 

  4. Chen, J., et al.: Minigpt-V2: large language model as a unified interface for vision-language multi-task learning. arXiv (2023)

    Google Scholar 

  5. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv (2023)

    Google Scholar 

  6. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  7. Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv (2021)

    Google Scholar 

  8. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  9. Deng, A., Chen, Z., Hooi, B.: Seeing is believing: mitigating hallucination in large vision-language models via clip-guided decoding. arXiv (2024)

    Google Scholar 

  10. Gao, P., et al.: Llama-adapter V2: parameter-efficient visual instruction model. arXiv (2023)

    Google Scholar 

  11. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  12. Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI Conference on Artificial Intelligence (AAAI) (2024)

    Google Scholar 

  13. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning. arXiv (2021)

    Google Scholar 

  14. Hu, H., Zhang, J., Zhao, M., Sun, Z.: Ciem: contrastive instruction evaluation method for better instruction tuning. arXiv (2023)

    Google Scholar 

  15. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  16. Hyun, L., Sung-Bin, K., Han, S., Yu, Y., Oh, T.H.: Smile: multimodal dataset for understanding laughter in video with language models. arXiv (2023)

    Google Scholar 

  17. Kirillov, A., et al.: Segment anything. arXiv (2023)

    Google Scholar 

  18. Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv (2023)

    Google Scholar 

  19. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)

    Google Scholar 

  20. Lin, T.Y., et al.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (ECCV) (2014)

    Google Scholar 

  21. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  22. Liu, H., et al.: A survey on hallucination in large vision-language models. arXiv (2024)

    Google Scholar 

  23. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)

    Google Scholar 

  24. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  25. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  26. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)

    Google Scholar 

  27. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018)

    Google Scholar 

  28. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  29. Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv (2023)

    Google Scholar 

  30. Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2022)

    Google Scholar 

  31. Taori, R., et al.: Alpaca: a strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, vol. 3, no. 6, p. 7 (2023)

    Google Scholar 

  32. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)

    Google Scholar 

  33. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)

    Google Scholar 

  34. Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv (2023)

    Google Scholar 

  35. Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)

  36. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv (2023)

    Google Scholar 

  37. Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv (2023)

    Google Scholar 

  38. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) (2014)

    Google Scholar 

  39. Zhai, B., et al.: Halle-switch: rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023)

  40. Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  41. Zhao, Y., et al.: Enhancing the spatial awareness capability of multi-modal large language model. arXiv (2023)

    Google Scholar 

  42. Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  43. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv (2023)

    Google Scholar 

  44. Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: learning with task prompts for high-quality versatile image inpainting. arXiv (2023)

    Google Scholar 

Download references

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub; No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities; No. RS-2019-II191906, Artificial Intelligence Graduate School Program (POSTECH)).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tae-Hyun Oh .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10084 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ye-Bin, M., Hyeon-Woo, N., Choi, W., Oh, TH. (2025). BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15069. Springer, Cham. https://doi.org/10.1007/978-3-031-73247-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73247-8_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73246-1

  • Online ISBN: 978-3-031-73247-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics