Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15066))

Included in the following conference series:

European Conference on Computer Vision

329 Accesses

Abstract

Despite the remarkable success of large vision-language models (LVLMs) on various tasks, their susceptibility to knowledge bias inherited from training data hinders their ability to generalize to new scenarios and limits their real-world applicability. To address this challenge, we propose the Counterfactual Bias-Robust Reasoning (CoBRa) dataset that tackles knowledge bias by offering a novel collection of VQA examples designed to evaluate and mitigate bias in LVLMs. These examples encourage counterfactual thinking by providing edited knowledge graphs and image contents, with detailed annotations of reasoning processes to facilitate a comprehensive understanding of the examples. Based on the dataset, we introduce a Chain of Counterfactual Thought (CoCT) method that learns the bias-robust reasoning processes and provides in-context examples demonstrating how existing reasoning generalizes to counterfactual scenarios. This enables LVLMs to explicitly reason step-by-step rather than relying on biased knowledge, leading to more generalizable solutions. Our extensive evaluation demonstrates that CoCT outperforms existing approaches on tasks requiring reasoning under knowledge bias. Our work is available at https://github.com/SuperJohnZhang/CoBRa.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809 (2020)
Google Scholar
Cho, J., Zala, A., Bansal, M.: Dall-Eval: probing the reasoning skills and social biases of text-to-image generation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3043–3054 (2023)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways (2022)
Google Scholar
Cui, C., et al.: Holistic analysis of hallucination in GPT-4V (ision): bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)
Dhuliawala, S., et al.: Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023)
Diao, S., Wang, P., Lin, Y., Zhang, T.: Active prompting with chain-of-thought for large language models (2023)
Google Scholar
Driess, D., et al.: PaLM-E: an embodied multimodal language model (2023)
Google Scholar
Facione, P.A., et al.: Critical thinking: what it is and why it counts. Insight Assess. 1(1), 1–23 (2011)
Google Scholar
Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Fu, Y., Peng, H., Sabharwal, A., Clark, P., Khot, T.: Complexity-based prompting for multi-step reasoning (2023)
Google Scholar
Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Mutant: a training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566 (2020)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)
Google Scholar
Guo, Y., Nie, L., Wong, Y., Liu, Y., Cheng, Z., Kankanhalli, M.: A unified end-to-end retriever-reader framework for knowledge-based VQA. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2061–2069 (2022)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
Google Scholar
Jie, Z., Lu, W.: Leveraging training data in few-shot prompting for numerical reasoning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics, ACL 2023, Toronto, Canada, pp. 10518–10526. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.findings-acl.668. https://aclanthology.org/2023.findings-acl.668
Kervadec, C., Antipov, G., Baccouche, M., Wolf, C.: Roses are red, violets are blue... but should VQA expect them to? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2776–2785 (2021)
Google Scholar
Khot, T., et al.: Decomposed prompting: a modular approach for solving complex tasks (2023)
Google Scholar
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291
Lample, G., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. CoRR abs/1711.00043 (2017). http://arxiv.org/abs/1711.00043
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. CoRR abs/1804.07755 (2018). http://arxiv.org/abs/1804.07755
Li, J., Cheng, X., Zhao, W.X., Nie, J.Y., Wen, J.R.: HaluEval: a large-scale hallucination evaluation benchmark for large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6449–6464 (2023)
Google Scholar
Li, Y., et al.: Making large language models better reasoners with step-aware verifier (2023)
Google Scholar
Likic, V.: The Needleman-Wunsch algorithm for sequence alignment. Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne, pp. 1–46 (2008)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, F., Bugliarello, E., Ponti, E.M., Reddy, S., Collier, N., Elliott, D.: Visually grounded reasoning across languages and cultures. arXiv preprint arXiv:2109.13238 (2021)
Liu, F., et al.: Hallusionbench: you see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Qu, Y., et al.: Integration of cognitive tasks into artificial general intelligence test for large models. arXiv preprint arXiv:2402.02547 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019). http://arxiv.org/abs/1910.10683
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)
Google Scholar
Rubin, O., Herzig, J., Berant, J.: Learning to retrieve prompts for in-context learning (2022)
Google Scholar
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, pp. 2443–2449. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3404835.3463257
Stan, G.B.M., et al.: LDM3D: latent diffusion model for 3D. arXiv preprint arXiv:2305.10853 (2023)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023)
Google Scholar
Yin, D., Li, L.H., Hu, Z., Peng, N., Chang, K.W.: Broaden the vision: geo-diverse visual commonsense reasoning. arXiv preprint arXiv:2109.06860 (2021)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)
Zhang, Y., Jiang, M., Zhao, Q.: Explicit knowledge incorporation for visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1356–1365 (2021)
Google Scholar
Zhang, Y., Jiang, M., Zhao, Q.: New datasets and models for contextual reasoning in visual dialog. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 434–451. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_25
Chapter Google Scholar
Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models (2022)
Google Scholar
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models (2023)
Google Scholar
Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models (2023)
Google Scholar
Zheng, K., Chen, X., Jenkins, O.C., Wang, X.E.: VLMbench: a compositional benchmark for vision-and-language manipulation (2022)
Google Scholar
Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models (2023)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models (2023)
Google Scholar

Download references

Acknowledgment

This work is supported by NSF Grants 2143197 and 2227450.

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, MN, 55455, USA
Yifeng Zhang, Ming Jiang & Qi Zhao

Authors

Yifeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifeng Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11098 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Jiang, M., Zhao, Q. (2025). Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-73242-3_19
Published: 29 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73241-6
Online ISBN: 978-3-031-73242-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics