Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Context-Aware Robust Fine-Tuning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Contrastive language-image pre-trained (CLIP) models have zero-shot ability of classifying an image belonging to “\(\mathtt {[CLASS]}\)” by using similarity between the image and the prompt sentence “a \(\mathtt {[CONTEXT]}\) of \(\mathtt {[CLASS]}\)”. Based on exhaustive text cues in “\(\mathtt {[CONTEXT]}\)”, CLIP model is aware of different contexts, e.g. background, style, viewpoint, and exhibits unprecedented robustness against a wide range of distribution shifts. However, recent works find further fine-tuning of CLIP models improves accuracy but sacrifices the robustness on downstream tasks. We conduct an empirical investigation to show fine-tuning will corrupt the context-aware ability of pre-trained CLIP features. To solve this problem, we propose Context-Aware Robust Fine-tuning (CAR-FT). CAR-FT regularizes the model during fine-tuning to capture the context information. Specifically, we use zero-shot prompt weights to get the context distribution contained in the image. By minimizing the Kullback–Leibler divergence (KLD) between context distributions induced by original/fine-tuned CLIP models, CAR-FT makes the context-aware ability of CLIP inherited into downstream tasks, and achieves both higher in-distribution (ID) and out-of-distribution (OOD) accuracy. The experimental results show CAR-FT achieves superior robustness on five OOD test datasets of ImageNet, and meanwhile brings accuracy gains on nine downstream tasks. Additionally, CAR-FT surpasses previous domain generalization (DG) methods and gets 78.5% averaged accuracy on DomainBed benchmark, building the new state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

ImageNet: https://www.image-net.org/ DomainBed: https://github.com/facebookresearch/DomainBed Flowers: https://www.robots.ox.ac.uk/~vgg/data/flowers/102/ Aircraft: https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/ CIFAR: https://www.cs.toronto.edu/~kriz/cifar.html Pets: https://www.robots.ox.ac.uk/~vgg/data/pets/ Cars: https://ai.stanford.edu/~jkrause/cars/car_dataset.html SUN397: https://vision.princeton.edu/projects/2010/SUN/ DTD: https://www.robots.ox.ac.uk/~vgg/data/dtd/ CLIP: https://github.com/openai/CLIP OpenCLIP: https://github.com/mlfoundations/open_clip

Code Availability

We plan to open-source codes for the community in the future.

Notes

  1. https://github.com/openai/CLIP/blob/main/data/prompts.md.

  2. https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb.

References

  • Andreassen, A., Bahri, Y., Neyshabur, B., & Roelofs, R. (2021). The evolution of out-of-distribution robustness throughout fine-tuning. arXiv preprint arXiv:2106.15831

  • Arpit, D., Wang, H., Zhou, Y., & Xiong, C. (2021). Ensemble of averages: Improving model selection and boosting performance in domain generalization. arXiv preprint arXiv:2110.10832

  • Bai, H., Zhou, F., & Hong, L., (2021) Nas-ood: Neural architecture search for out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8320–8329).

  • Barbu, A., Mayo, D., & Alverio, J., (2019) Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in Neural Information Processing Systems 32.

  • Beery, S., Van Horn, G., & Perona, P. (2018) Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV) (pp. 456–473).

  • Cha, J., Chun, S., Lee, K., et al. (2021). Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34, 22405–22418.

    Google Scholar 

  • Cha, J., Lee, K., Park, S., & Chun, S. (2022) Domain generalization by mutual-information regularization with pre-trained models. arXiv preprint arXiv:2203.10789

  • Chefer, H., Gur, S., & Wolf, L. (2021). Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 397–406).

  • Chen, G., Yao, W., Song, X., Li, X., Rao, Y., & Zhang, K. (2022). Plot: Prompt learning with optimal transport for vision-language models. In The Eleventh international conference on learning representations.

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014) Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3606–3613).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L.(2009). Imagenet: A large-scale hierarchical image database. In CVPR.

  • Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., & Schmidt, L. (2022) Data determines distributional robustness in contrastive language image pre-training (clip). arXiv preprint arXiv:2205.01397.

  • Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020) Sharpness-aware minimization for efficiently improving generalization. In International conference on learning representations.

  • Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., & Qiao, Y. (2021) Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.

  • Ge, W., & Yu, Y. (2017). Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp 1086–1095).

  • Gulrajani, I., & Lopez-Paz, D. (2020) In search of lost domain generalization. In International conference on learning representations.

  • Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., & Feris, R. (2019) Spottune: Transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4805–4814).

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06) (IEEE, pp. 1735–1742).

  • He, K., Zhang, X., & Ren, S., et al (2016). Deep residual learning for image recognition. In CVPR.

  • He, Y., Shen, Z., & Cui, P. (2021). Towards non-iid image classification: A dataset and baselines. Pattern Recognition, 110(107), 383.

    Google Scholar 

  • Hendrycks, D., Mu, N., Cubuk, E.D., & Lakshminarayanan, B. (2019). Augmix: A simple data processing method to improve robustness and uncertainty. In International conference on learning representations.

  • Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., & Gilmer, J. (2021a) The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340–8349).

  • Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b) Natural adversarial examples. In CVPR.

  • Herrmann, C., Sargent, K., Jiang, L., Zabih, R., Chang, H., Liu, C., & Sun, D. (2022) Pyramid adversarial training improves vit performance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 13419–13429).

  • Ilharco, G., Wortsman, M., & Wightman, R., et al. (2021). Openclip. https://doi.org/10.5281/zenodo.5143773

  • Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., & Duerig, T. (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.

  • Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2023) Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19113–19122).

  • Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013) 3d object representations for fine-grained categorization. In ICCV-W.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

    Article  Google Scholar 

  • Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2021) Fine-tuning can distort pretrained features and underperform out-of-distribution. In International conference on learning representations.

  • Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., & Gao, J.(2022) Elevater: A benchmark and toolkit for evaluating language-augmented visual models. arXiv preprint arXiv:2204.08790

  • Li, D., Yang, Y., Song, Y. Z., & Hospedales, T. M. (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (pp. 5542–5550).

  • Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., & Yan, J. ((2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International conference on learning representations.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021) Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586

  • Loshchilov, I., & Hutter, F. (2018) Decoupled weight decay regularization. In International conference on learning representations.

  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151

  • Mao, X., Chen, Y., Duan, R., Zhu, Y., Qi, G., Li, X., & Xue, H. (2022a) Enhance the visual representation via discrete adversarial training. In: NeurIPS.

  • Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., & Xue, H. (2022b) Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12042–12051).

  • Miller, J.P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., & Schmidt, L. (2021) Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning, PMLR (pp. 7721–7735).

  • Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., et al. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521–530.

    Article  Google Scholar 

  • Mu, N., Kirillov, A., Wagner, D., & Xie, S(2022) Slip: Self-supervision meets language-image pre-training. In European conference on computer vision (pp. 529–544). Springer.

  • Nilsback, M.E., & Zisserman, A. (2008) Automated flower classification over a large number of classes. In ICVGIP.

  • Parkhi, O.M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012) Cats and dogs. In CVPR.

  • Paul, S., & Chen, P.Y. (2022). Vision transformers are robust learners. In Proceedings of the AAAI conference on artificial intelligence (pp. 2071–2081).

  • Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1406–1415).

  • Petroni, F., Rocktäschel, T., Lewis P, Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? In EMNLP.

  • Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A. W., & Le, Q. V. (2021) Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sutskever, I. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, PMLR (pp. 8748–8763).

  • Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollair, P. (2020) Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10428–10436).

  • Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019) Do imagenet classifiers generalize to imagenet? In ICML.

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., & Jitsev, J. (2022) Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth conference on neural information processing systems datasets and benchmarks track.

  • Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020) Measuring robustness to natural distribution shifts in image classification. In NeurIPS.

  • Thomee, B., Shamma, D. A., Friedland, G., et al. (2016). Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2), 64–73.

    Article  Google Scholar 

  • Torralba, A., & Efros, A. (2011) Unbiased look at dataset bias. In Proceedings of the 2011 IEEE conference on computer vision and pattern recognition (pp. 1521–1528).

  • Venkateswara, H., Eusebio, J., Chakraborty, S. (2017). Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5018–5027).

  • Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019) Learning robust global representations by penalizing local predictive power. In NeurIPS.

  • Wang, Z., Bai, Y., Zhou, Y., & Xie, C. (2022) Can cnns be more robust than transformers? arXiv preprint arXiv:2206.03452

  • Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., & Schmidt, L. (2022a) Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, PMLR (pp 23965–23998).

  • Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., & Schmidt, L. (2022b) Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 7959–7971).

  • Xiao, J., Ehinger, K. A., Hays, J., et al. (2016). Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119(1), 3–22.

    Article  MathSciNet  Google Scholar 

  • Xie, C., Tan, M., Gong, B., Wang, J., Yuille, A. L., & Le, Q. V.(2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 819–828).

  • Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., & Xu, C. (2021) Filip: Fine-grained interactive language-image pre-training. In International conference on learning representations.

  • Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022) Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917

  • Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., & Zhang, P. (2021) Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432

  • Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022) Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18123–18133).

  • Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., & Li, H. (2022a) Tip-adapter: Training-free adaption of clip for few-shot classification. In European conference on computer vision (pp. 493–510). Springer.

  • Zhang, X., Gu, S. S., Matsuo, Y., & Iwasawa, Y. (2022b) Domain prompt learning for efficiently adapting clip to unseen domains. arXiv preprint arXiv:2111.12853

  • Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., & Cui, P. (2022c) Nico++: Towards better benchmarking for domain generalization. arXiv preprint arXiv:2204.08040

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2021) Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022) Conditional prompt learning for vision-language models. arXiv preprint arXiv:2203.05557

Download references

Acknowledgements

This research is supported in part by the National Key Research and Development Program of China under Grant No.2020AAA0140000.

Funding

This research is supported in part by the National Key Research and Development Program of China under Grant No. 2020AAA0140000.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the research conception and design. Methodology: [Xiaofeng Mao]; Material preparation: [Xiaofeng Mao], [Yuefeng Chen]; Formal analysis and investigation: [Rong Zhang], [Hui Xue], [Zhao Li]; Writing - original draft preparation: [Xiaofeng Mao]; Writing - review and editing: [Xiaofeng Mao], [Yuefeng Chen], [Xiaojun Jia].

Corresponding author

Correspondence to Xiaofeng Mao.

Ethics declarations

Financial interests

Xiaojun Jia received a Ph.D. stipend from Institute of Information Engineering, Chinese Academy of Sciences. Xiaofeng Mao, Yuefeng Chen, Rong Zhang, and Hui Xue received salaries from the Alibaba Group. Zhao Li received salaries from Zhejiang University.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 157 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mao, X., Chen, Y., Jia, X. et al. Context-Aware Robust Fine-Tuning. Int J Comput Vis 132, 1685–1700 (2024). https://doi.org/10.1007/s11263-023-01951-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01951-2

Keywords

Navigation