DEAL: Disentangle and Localize Concept-Level Explanations for VLMs

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15097))

Included in the following conference series:

European Conference on Computer Vision

72 Accesses

Abstract

Large pre-trained Vision-Language Models (VLMs) have become ubiquitous foundational components of other models and downstream tasks. Although powerful, our empirical results reveal that such models might not be able to identify fine-grained concepts. Specifically, the explanations of VLMs with respect to fine-grained concepts are entangled and mislocalized. To address this issue, we propose to DisEntAngle and Localize (DEAL) the concept-level explanations for VLMs without human annotations. The key idea is encouraging the concept-level explanations to be distinct while maintaining consistency with category-level explanations. We conduct extensive experiments and ablation studies on a wide range of benchmark datasets and vision-language models. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability. Surprisingly, the improved explainability alleviates the model’s reliance on spurious correlations, which further benefits the prediction accuracy.

Our source code and pretrained weights: https://github.com/tangli-udel/DEAL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

CoProNN: Concept-Based Prototypical Nearest Neighbors for Explaining Vision Models

DeViL: Decoding Vision features into Language

References

Achtibat, R., et al.: From “where" to “what": towards human-understandable explanations through concept relevance propagation. arXiv preprint arXiv:2206.03208 (2022)
K. Ranasinghe, et al.: Perceptual grouping in contrastive vision-language models. In: ICCV (2023)
Google Scholar
Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019)
Bhatt, U., Weller, A., Moura, J.M.F.: Evaluating and aggregating feature-based model explanations. In: Proceedings of IJCAI (2021)
Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Boyd, S., Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 397–406 (2021)
Google Scholar
Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Cresswell, M.J.: Logics and languages (1973)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Gao, Y., et al.: Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Adv. Neural. Inf. Process. Syst. 35, 35959–35970 (2022)
Google Scholar
Goyal, S., Kumar, A., Garg, S., Kolter, Z., Raghunathan, A.: Finetune like you pretrain: improved finetuning of zero-shot vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19338–19347 (2023)
Google Scholar
Havasi, M., Parbhoo, S., Doshi-Velez, F.: Addressing leakage in concept bottleneck models. Adv. Neural. Inf. Process. Syst. 35, 23386–23397 (2022)
Google Scholar
He, J., et al.: Partimagenet: a large, high-quality dataset of parts. In: European Conference on Computer Vision, pp. 128–145. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20074-8_8
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observat. Remote Sens. (2019)
Google Scholar
Hind, M., et al.: Ted: teaching AI to explain its decisions. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 123–129 (2019)
Google Scholar
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In: International Conference on Machine Learning, pp. 2668–2677. PMLR (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Google Scholar
Li, T., Gao, J., Peng, X.: Deep learning for spatiotemporal modeling of urbanization. arXiv preprint arXiv:2112.09668 (2021)
Li, T., Qiao, F., Ma, M., Peng, X.: Are data-driven explanations robust against out-of-distribution data? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3821–3831 (2023)
Google Scholar
Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: International Conference on Learning Representations (2021)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Ma, M., Li, T., Peng, X.: Beyond the federation: topology-aware federated learning for generalization to unseen clients. In: Forty-first International Conference on Machine Learning (2024)
Google Scholar
Ma, M., Ren, J., Zhao, L., Tulyakov, S., Wu, C., Peng, X.: Smil: multimodal learning with severely missing modality. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2302–2310 (2021)
Google Scholar
Ma, Z., Hong, J., Gul, M.O., Gandhi, M., Gao, I., Krishna, R.: Crepe: can vision-language foundation models reason compositionally? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921 (2023)
Google Scholar
Meng, C., Trinh, L., Xu, N., Enouen, J., Liu, Y.: Interpretability and fairness evaluation of deep learning models on mimic-iv dataset. Sci. Rep. 12(1), 7166 (2022)
Article Google Scholar
Menon, S., Vondrick, C.: Visual classification via description from large language models. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Parcalabescu, L., Gatt, A., Frank, A., Calixto, I.: Seeing past words: testing the cross-modal capabilities of pretrained v &l models on counting tasks. In: Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), pp. 32–44 (2021)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
Google Scholar
Peng, A., Wu, M., Allard, J., Kilpatrick, L., Heidel, S.: Gpt-3.5 turbo fine-tuning and api updates (2023). https://openai.com/blog/gpt-3-5-turbo/
Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)
Google Scholar
Qiao, F., Zhao, L., Peng, X.: Learning to learn single domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12556–12565 (2020)
Google Scholar
Qin, Z., Yi, H.H., Lao, Q., Li, K.: Medical image understanding with pretrained vision language models: a comprehensive study. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)
Article Google Scholar
Saha, O., Cheng, Z., Maji, S.: Improving few-shot part segmentation using coarse supervision. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, pp. 283–299. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20056-4_17
Chapter Google Scholar
Schulz, K., Sixt, L., Tombari, F., Landgraf, T.: Restricting the flow: information bottlenecks for attribution. In: International Conference on Learning Representations
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Shen, S., et al.: How much can clip benefit vision-and-language tasks? In: International Conference on Learning Representations (2021)
Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Singh, A., et al.: Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650 (2022)
Google Scholar
Thrush, T., et al.: Winoground: probing vision and language models for visio-linguistic compositionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248 (2022)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
Google Scholar
Wang, Y., Zhang, J., Kan, M., Shan, S., Chen, X.: Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12275–12284 (2020)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903 (2021)
Yamada, Y., Tang, Y., Yildirim, I.: When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043 (2022)
Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., Yatskar, M.: Language in a bottle: language model guided concept bottlenecks for interpretable image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19187–19197 (2023)
Google Scholar
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Zeng, Y., Zhang, X., Li, H.: Multi-grained vision language pre-training: aligning texts with visual concepts. In: International Conference on Machine Learning, pp. 25994–26009. PMLR (2022)
Google Scholar

Download references

Acknowledgment

This work is supported by the National Science Foundation through the Faculty Early Career Development Program (NSF CAREER) Award NSF-IIS-2340074 and the Department of Defense under the Defense Established Program to Stimulate Competitive Research (DoD DEPSCoR) Award.

Author information

Authors and Affiliations

Deep-REAL Lab., Department of Computer and Information Science, University of Delaware, Newark, USA
Tang Li, Mengmeng Ma & Xi Peng

Authors

Tang Li
View author publications
You can also search for this author in PubMed Google Scholar
Mengmeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xi Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tang Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1375 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, T., Ma, M., Peng, X. (2025). DEAL: Disentangle and Localize Concept-Level Explanations for VLMs. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-72933-1_22
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DEAL: Disentangle and Localize Concept-Level Explanations for VLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

CoProNN: Concept-Based Prototypical Nearest Neighbors for Explaining Vision Models

DeViL: Decoding Vision features into Language

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1375 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DEAL: Disentangle and Localize Concept-Level Explanations for VLMs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning

CoProNN: Concept-Based Prototypical Nearest Neighbors for Explaining Vision Models

DeViL: Decoding Vision features into Language

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1375 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation