Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Jihai Zhang¹³,
Xiang Lan¹⁴,
Xiaoye Qu¹⁵,
Yu Cheng¹³,
Mengling Feng¹⁴ &
…
Bryan Hooi¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15141))

Included in the following conference series:

European Conference on Computer Vision

16 Accesses

Abstract

Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL’s effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark. Codes are available at https://github.com/MajorDavidZhang/MCL.git.

J. Zhang and X. Lan—Equal Contribution

Xiang Lan and Mengling Feng are affiliated with Saw Swee Hock School of Public Health and Institute of Data Science, NUS. Part of this work was done during Jihai Zhang’s internship at Shanghai AI Laboratory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Article Open access 26 September 2023

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Notes

1.
More detailed dataset description and settings can be found in Sect. 5.

References

Assran, M., et al.: The hidden uniform cluster prior in self-supervised learning. arXiv preprint arXiv:2210.07277 (2022)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
Bleeker, M., Yates, A., de Rijke, M.: Reducing predictive feature suppression in resource-constrained contrastive image-caption retrieval. Trans. Mach. Learn. Res. (2023)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural. Inf. Process. Syst. 33, 9912–9924 (2020)
Google Scholar
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, T., Luo, C., Li, L.: Intriguing properties of contrastive losses. Adv. Neural. Inf. Process. Syst. 34, 11834–11845 (2021)
Google Scholar
Chen, T.S., Hung, W.C., Tseng, H.Y., Chien, S.Y., Yang, M.H.: Incremental false negative detection for contrastive learning. arXiv preprint arXiv:2106.03719 (2021)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829 (2023)
Google Scholar
Chu, T., et al.: Image clustering via the principle of rate reduction in the age of pretrained models. arXiv preprint arXiv:2306.05272 (2023)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
Google Scholar
Federici, M., Dutta, A., Forré, P., Kushman, N., Akata, Z.: Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017 (2020)
Gandelsman, Y., Efros, A.A., Steinhardt, J.: Interpreting CLIP’s image representation via text-based decomposition. arXiv preprint arXiv:2310.05916 (2023)
Ge, S., Mishra, S., Li, C.L., Wang, H., Jacobs, D.: Robust contrastive learning using negative samples with diminished semantics. Adv. Neural. Inf. Process. Syst. 34, 27356–27368 (2021)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hermann, K., Lampinen, A.: What shapes feature representations? exploring datasets, architectures, and training. Adv. Neural. Inf. Process. Syst. 33, 9995–10006 (2020)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)
Article Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Toronto, ON, Canada (2009)
Google Scholar
Kukleva, A., Böhle, M., Schiele, B., Kuehne, H., Rupprecht, C.: Temperature schedules for self-supervised contrastive methods on long-tail data. arXiv preprint arXiv:2303.13664 (2023)
Lan, X., Ng, D., Hong, S., Feng, M.: Intra-inter subject self-supervised learning for multivariate cardiac signals. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 4, pp. 4532–4540 (2022). https://doi.org/10.1609/aaai.v36i4.20376, https://ojs.aaai.org/index.php/AAAI/article/view/20376
Lan, X., Yan, H., Hong, S., Feng, M.: Towards enhancing time series contrastive learning: a dynamic bad pair mining approach. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=K2c04ulKXn
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. ATT Labs 2 (2010). http://yann.lecun.com/exdb/mnist
Leng, S., et al.: Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922 (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Mitigating hallucination in large multi-modal models via robust instruction tuning. In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, X., et al.: Self-supervised learning: generative or contrastive. IEEE Trans. Knowl. Data Eng. 35(1), 857–876 (2021)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Large-scale CelebFaces attributes (CelebA) dataset. Retrieved August 15(2018), 11 (2018)
Google Scholar
Mishra, S., et al.: Learning visual representations for transfer learning by suppressing texture. arXiv preprint arXiv:2011.01901 (2020)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Robinson, J., Sun, L., Yu, K., Batmanghelich, K., Jegelka, S., Sra, S.: Can contrastive learning avoid shortcut solutions? Adv. Neural. Inf. Process. Syst. 34, 4974–4986 (2021)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Shah, A., Sra, S., Chellappa, R., Cherian, A.: Max-margin contrastive learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8220–8230 (2022)
Google Scholar
Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R.S., Harwath, D., Glass, J., Kuehne, H.: Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20020–20029 (2022)
Google Scholar
Tamkin, A., Glasgow, M., He, X., Goodman, N.: Feature dropout: revisiting the role of augmentations in contrastive learning. arXiv preprint arXiv:2212.08378 (2022)
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:2401.06209 (2024)
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Van Gool, L.: Unsupervised semantic segmentation by contrasting object mask proposals. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10052–10062 (2021)
Google Scholar
Woo, G., Liu, C., Sahoo, D., Kumar, A., Hoi, S.: CoST: contrastive learning of disentangled seasonal-trend representations for time series forecasting. arXiv preprint arXiv:2202.01575 (2022)
Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659 (2020)
Xue, Y., Joshi, S., Gan, E., Chen, P.Y., Mirzasoleiman, B.: Which features are learnt by contrastive learning? On the role of simplicity bias in class collapse and feature suppression. arXiv preprint arXiv:2305.16536 (2023)
Yin, S., et al.: A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023)
Yue, Z., et al.: TS2Vec: towards universal representation of time series. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8980–8987 (2022)
Google Scholar
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133 (2022)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgement

This project was funded by the National Research Foundation Singapore under AI Singapore Programme (Award Number: AISG-GC-2019-001-2B and AISG2-TC-2022-004).

Author information

Authors and Affiliations

The Chinese University of Hong Kong, Sha Tin, Hong Kong
Jihai Zhang & Yu Cheng
National University of Singapore, Singapore, Singapore
Xiang Lan, Mengling Feng & Bryan Hooi
Shanghai AI Laboratory, Shanghai, China
Xiaoye Qu

Authors

Jihai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Lan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoye Qu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Mengling Feng
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Hooi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mengling Feng or Bryan Hooi .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 391 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Lan, X., Qu, X., Cheng, Y., Feng, M., Hooi, B. (2025). Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-73010-8_3
Published: 10 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73009-2
Online ISBN: 978-3-031-73010-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 391 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Similarity contrastive estimation for image and video soft contrastive self-supervised learning

Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 391 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation