Abstract
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and the LLM endows its high reasoning ability to VLMs. It leads VLMs to achieve high performance on wide benchmarks without fine-tuning, exhibiting zero or few-shot capability. However, recent studies show that VLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from VLMs. To enhance trustworthiness and better tackle the hallucination of VLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is to manipulate visual scene information by image editing models and to design the metrics based on scene changes. This allows us to clearly assess whether VLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize image-wise object relationship by virtue of our two-axis view: vision and text. Upon evaluating VLMs with our dataset, we observed that our metrics reveal different aspects of VLM hallucination that have not been reported before. Project page: https://beafbench.github.io/
Authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv (2023)
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: PIQA: reasoning about physical commonsense in natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Chen, J., et al.: Minigpt-V2: large language model as a unified interface for vision-language multi-task learning. arXiv (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv (2021)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Deng, A., Chen, Z., Hooi, B.: Seeing is believing: mitigating hallucination in large vision-language models via clip-guided decoding. arXiv (2024)
Gao, P., et al.: Llama-adapter V2: parameter-efficient visual instruction model. arXiv (2023)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Gunjal, A., Yin, J., Bas, E.: Detecting and preventing hallucinations in large vision language models. In: AAAI Conference on Artificial Intelligence (AAAI) (2024)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning. arXiv (2021)
Hu, H., Zhang, J., Zhao, M., Sun, Z.: Ciem: contrastive instruction evaluation method for better instruction tuning. arXiv (2023)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Hyun, L., Sung-Bin, K., Han, S., Yu, Y., Oh, T.H.: Smile: multimodal dataset for understanding laughter in video with language models. arXiv (2023)
Kirillov, A., et al.: Segment anything. arXiv (2023)
Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023)
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: European Conference on Computer Vision (ECCV) (2014)
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: International Conference on Learning Representations (ICLR) (2024)
Liu, H., et al.: A survey on hallucination in large vision-language models. arXiv (2024)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. In: International Conference on Learning Representations (ICLR) (2024)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv (2023)
Suvorov, R., et al.: Resolution-robust large mask inpainting with fourier convolutions. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2022)
Taori, R., et al.: Alpaca: a strong, replicable instruction-following model. Stanford Center for Research on Foundation Models, vol. 3, no. 6, p. 7 (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Wang, J., et al.: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv (2023)
Wang, J., et al.: Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126 (2023)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv (2023)
Yin, S., et al.: Woodpecker: hallucination correction for multimodal large language models. arXiv (2023)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) (2014)
Zhai, B., et al.: Halle-switch: rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779 (2023)
Zhang, R., et al.: LLaMA-adapter: efficient fine-tuning of language models with zero-init attention. In: International Conference on Learning Representations (ICLR) (2024)
Zhao, Y., et al.: Enhancing the spatial awareness capability of multi-modal large language model. arXiv (2023)
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. In: International Conference on Learning Representations (ICLR) (2024)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv (2023)
Zhuang, J., Zeng, Y., Liu, W., Yuan, C., Chen, K.: A task is worth one word: learning with task prompts for high-quality versatile image inpainting. arXiv (2023)
Acknowledgments
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub; No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities; No. RS-2019-II191906, Artificial Intelligence Graduate School Program (POSTECH)).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ye-Bin, M., Hyeon-Woo, N., Choi, W., Oh, TH. (2025). BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15069. Springer, Cham. https://doi.org/10.1007/978-3-031-73247-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-73247-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73246-1
Online ISBN: 978-3-031-73247-8
eBook Packages: Computer ScienceComputer Science (R0)