Nothing Special   »   [go: up one dir, main page]

Skip to main content

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Despite the remarkable success of large vision-language models (LVLMs) on various tasks, their susceptibility to knowledge bias inherited from training data hinders their ability to generalize to new scenarios and limits their real-world applicability. To address this challenge, we propose the Counterfactual Bias-Robust Reasoning (CoBRa) dataset that tackles knowledge bias by offering a novel collection of VQA examples designed to evaluate and mitigate bias in LVLMs. These examples encourage counterfactual thinking by providing edited knowledge graphs and image contents, with detailed annotations of reasoning processes to facilitate a comprehensive understanding of the examples. Based on the dataset, we introduce a Chain of Counterfactual Thought (CoCT) method that learns the bias-robust reasoning processes and provides in-context examples demonstrating how existing reasoning generalizes to counterfactual scenarios. This enables LVLMs to explicitly reason step-by-step rather than relying on biased knowledge, leading to more generalizable solutions. Our extensive evaluation demonstrates that CoCT outperforms existing approaches on tasks requiring reasoning under knowledge bias. Our work is available at https://github.com/SuperJohnZhang/CoBRa.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)

    Google Scholar 

  2. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  3. Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  4. Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809 (2020)

    Google Scholar 

  5. Cho, J., Zala, A., Bansal, M.: Dall-Eval: probing the reasoning skills and social biases of text-to-image generation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3043–3054 (2023)

    Google Scholar 

  6. Chowdhery, A., et al.: Palm: scaling language modeling with pathways (2022)

    Google Scholar 

  7. Cui, C., et al.: Holistic analysis of hallucination in GPT-4V (ision): bias and interference challenges. arXiv preprint arXiv:2311.03287 (2023)

  8. Dhuliawala, S., et al.: Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023)

  9. Diao, S., Wang, P., Lin, Y., Zhang, T.: Active prompting with chain-of-thought for large language models (2023)

    Google Scholar 

  10. Driess, D., et al.: PaLM-E: an embodied multimodal language model (2023)

    Google Scholar 

  11. Facione, P.A., et al.: Critical thinking: what it is and why it counts. Insight Assess. 1(1), 1–23 (2011)

    Google Scholar 

  12. Fu, C., et al.: MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  13. Fu, Y., Peng, H., Sabharwal, A., Clark, P., Khot, T.: Complexity-based prompting for multi-step reasoning (2023)

    Google Scholar 

  14. Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Mutant: a training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566 (2020)

  15. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

    Google Scholar 

  16. Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)

    Google Scholar 

  17. Guo, Y., Nie, L., Wong, Y., Liu, Y., Cheng, Z., Kankanhalli, M.: A unified end-to-end retriever-reader framework for knowledge-based VQA. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2061–2069 (2022)

    Google Scholar 

  18. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

    Google Scholar 

  19. Jie, Z., Lu, W.: Leveraging training data in few-shot prompting for numerical reasoning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics, ACL 2023, Toronto, Canada, pp. 10518–10526. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.findings-acl.668. https://aclanthology.org/2023.findings-acl.668

  20. Kervadec, C., Antipov, G., Baccouche, M., Wolf, C.: Roses are red, violets are blue... but should VQA expect them to? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2776–2785 (2021)

    Google Scholar 

  21. Khot, T., et al.: Decomposed prompting: a modular approach for solving complex tasks (2023)

    Google Scholar 

  22. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners (2023)

    Google Scholar 

  23. Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291

  24. Lample, G., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. CoRR abs/1711.00043 (2017). http://arxiv.org/abs/1711.00043

  25. Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. CoRR abs/1804.07755 (2018). http://arxiv.org/abs/1804.07755

  26. Li, J., Cheng, X., Zhao, W.X., Nie, J.Y., Wen, J.R.: HaluEval: a large-scale hallucination evaluation benchmark for large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6449–6464 (2023)

    Google Scholar 

  27. Li, Y., et al.: Making large language models better reasoners with step-aware verifier (2023)

    Google Scholar 

  28. Likic, V.: The Needleman-Wunsch algorithm for sequence alignment. Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne, pp. 1–46 (2008)

    Google Scholar 

  29. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  30. Liu, F., Bugliarello, E., Ponti, E.M., Reddy, S., Collier, N., Elliott, D.: Visually grounded reasoning across languages and cultures. arXiv preprint arXiv:2109.13238 (2021)

  31. Liu, F., et al.: Hallusionbench: you see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566 (2023)

  32. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  33. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  34. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: The 36th Conference on Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  35. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)

    Google Scholar 

  36. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  37. Qu, Y., et al.: Integration of cognitive tasks into artificial general intelligence test for large models. arXiv preprint arXiv:2402.02547 (2024)

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  39. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019). http://arxiv.org/abs/1910.10683

  40. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  41. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

    Google Scholar 

  42. Rubin, O., Herzig, J., Berant, J.: Learning to retrieve prompts for in-context learning (2022)

    Google Scholar 

  43. Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, pp. 2443–2449. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3404835.3463257

  44. Stan, G.B.M., et al.: LDM3D: latent diffusion model for 3D. arXiv preprint arXiv:2305.10853 (2023)

  45. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  46. Wang, X., et al.: Self-consistency improves chain of thought reasoning in language models. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  47. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023)

    Google Scholar 

  48. Yin, D., Li, L.H., Hu, Z., Peng, N., Chang, K.W.: Broaden the vision: geo-diverse visual commonsense reasoning. arXiv preprint arXiv:2109.06860 (2021)

  49. Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  50. Yue, X., et al.: MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. arXiv preprint arXiv:2311.16502 (2023)

  51. Zhang, Y., Jiang, M., Zhao, Q.: Explicit knowledge incorporation for visual reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1356–1365 (2021)

    Google Scholar 

  52. Zhang, Y., Jiang, M., Zhao, Q.: New datasets and models for contextual reasoning in visual dialog. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13696, pp. 434–451. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_25

    Chapter  Google Scholar 

  53. Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models (2022)

    Google Scholar 

  54. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models (2023)

    Google Scholar 

  55. Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models (2023)

    Google Scholar 

  56. Zheng, K., Chen, X., Jenkins, O.C., Wang, X.E.: VLMbench: a compositional benchmark for vision-and-language manipulation (2022)

    Google Scholar 

  57. Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models (2023)

    Google Scholar 

  58. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models (2023)

    Google Scholar 

Download references

Acknowledgment

This work is supported by NSF Grants 2143197 and 2227450.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifeng Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11098 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Jiang, M., Zhao, Q. (2025). Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15066. Springer, Cham. https://doi.org/10.1007/978-3-031-73242-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73242-3_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73241-6

  • Online ISBN: 978-3-031-73242-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics