BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15069))

Included in the following conference series:

European Conference on Computer Vision

18 Accesses

Abstract

Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and the LLM endows its high reasoning ability to VLMs. It leads VLMs to achieve high performance on wide benchmarks without fine-tuning, exhibiting zero or few-shot capability. However, recent studies show that VLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from VLMs. To enhance trustworthiness and better tackle the hallucination of VLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is to manipulate visual scene information by image editing models and to design the metrics based on scene changes. This allows us to clearly assess whether VLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize image-wise object relationship by virtue of our two-axis view: vision and text. Upon evaluating VLMs with our dataset, we observed that our metrics reveal different aspects of VLM hallucination that have not been reported before. Project page: https://beafbench.github.io/

Authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

References

Achiam, J., et al.: GPT-4 technical report. arXiv (2023)
Google Scholar
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: PIQA: reasoning about physical commonsense in natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Chen, J., et al.: Minigpt-V2: large language model as a unified interface for vision-language multi-task learning. arXiv (2023)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv (2023)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv (2021)
Google Scholar
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Deng, A., Chen, Z., Hooi, B.: Seeing is believing: mitigating hallucination in large vision-language models via clip-guided decoding. arXiv (2024)
Google Scholar
Gao, P., et al.: Llama-adapter V2: parameter-efficient visual instruction model. arXiv (2023)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI Conference on Artificial Intelligence (AAAI) (2024)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning. arXiv (2021)
Google Scholar
Hu, H., Zhang, J., Zhao, M., Sun, Z.: Ciem: contrastive instruction evaluation method for better instruction tuning. arXiv (2023)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Hyun, L., Sung-Bin, K., Han, S., Yu, Y., Oh, T.H.: Smile: multimodal dataset for understanding laughter in video with language models. arXiv (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv (2023)
Google Scholar
Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv (2023)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (ECCV) (2014)
Google Scholar
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Liu, H., et al.: A survey on hallucination in large vision-language models. arXiv (2024)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)
Google Scholar
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv (2023)
Google Scholar
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2022)
Google Scholar
Taori, R., et al.: Alpaca: a strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, vol. 3, no. 6, p. 7 (2023)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Google Scholar
Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv (2023)
Google Scholar
Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv (2023)
Google Scholar
Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv (2023)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) (2014)
Google Scholar
Zhai, B., et al.: Halle-switch: rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023)
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhao, Y., et al.: Enhancing the spatial awareness capability of multi-modal large language model. arXiv (2023)
Google Scholar
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv (2023)
Google Scholar
Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: learning with task prompts for high-quality versatile image inpainting. arXiv (2023)
Google Scholar

Download references

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub; No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities; No. RS-2019-II191906, Artificial Intelligence Graduate School Program (POSTECH)).

Author information

Authors and Affiliations

Department of EE, POSTECH, Pohang, Korea
Moon Ye-Bin, Nam Hyeon-Woo & Tae-Hyun Oh
Graduate School of AI, POSTECH, Pohang, Korea
Wonseok Choi & Tae-Hyun Oh
Institute for Convergence Research and Education in Advanced Technology, Yonsei University, Seoul, Korea
Tae-Hyun Oh

Authors

Moon Ye-Bin
View author publications
You can also search for this author in PubMed Google Scholar
Nam Hyeon-Woo
View author publications
You can also search for this author in PubMed Google Scholar
Wonseok Choi
View author publications
You can also search for this author in PubMed Google Scholar
Tae-Hyun Oh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tae-Hyun Oh .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10084 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye-Bin, M., Hyeon-Woo, N., Choi, W., Oh, TH. (2025). BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15069. Springer, Cham. https://doi.org/10.1007/978-3-031-73247-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-73247-8_14
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73246-1
Online ISBN: 978-3-031-73247-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 10084 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 10084 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation