Abstract
Strong image search models can be learned for a specific domain, i.e. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed , an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task label-aware oracle that selects the most fitting pseudo-granularity per task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We will use the term classes to refer to sets of images with the same label, whether the latter represents object instances or fine-grained classes.
- 2.
We chose DINO over the CLIP [45] model as the training set of CLIP is not public. Therefore the data in MRT might be part of its 400M image-text training pairs.
References
Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. Tech. rep, Stanford (2006)
Avrithis, Y., Kalantidis, Y.: Approximate gaussian mixtures for large scale vocabularies. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 15–28. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_2
Avrithis, Y., Kalantidis, Y., Anagnostopoulos, E., Emiris, I.Z.: Web-scale image clustering revisited. In: Proceedings of ICCV (2015)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Boudiaf, M., et al.: A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 548–564. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_33
Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for image search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 726–743. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_43
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Proceedings of NeurIPS (2020)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of ICCV (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of ICML (2020)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of CVPR (2005)
Csurka, G. (ed.): Domain Adaptation in Computer Vision Applications. ACVPR, Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58347-1
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of CVPR (2009)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of CVPR (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of ICLR (2021)
Elezi, I., Vascon, S., Torcinovich, A., Pelillo, M., Leal-Taixé, L.: The group loss for deep metric learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 277–294. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_17
Fehervari, I., Ravichandran, A., Appalaraju, S.: Unbiased evaluation of deep metric learning algorithms. arXiv preprint arXiv:1911.12528 (2019)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: Proceedings of ICLR (2018)
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Goyal, P., et al.: Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988 (2021)
Grill, J.B., et al.: Bootstrap your own latent: A new approach to self-supervised learning. In: Proceedings of NeurIPS (2020)
Gu, G., Ko, B.: Symmetrical synthesis for deep metric learning. In: Proceedings of AAAI (2020)
Gu, G., Ko, B., Kim, H.G.: Proxy synthesis: Learning with synthetic classes for deep metric learning. In: Proceedings of AAAI (2021)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of CVPR (2020)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv:1606.08415 (2016)
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: Discriminative embeddings for segmentation and separation. In: Proceedings of ICASSP (2016)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Proceedings of ICML (2019)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of ICML (2021)
Kalantidis, Y., Lassance, C., Almazán, J., Larlus, D.: TLDR: Twin learning for dimensionality reduction. In: TMLR (2022)
Ko, B., Gu, G.: Embedding expansion: Augmentation in embedding space for deep metric learning. In: Proceedings of CVPR (2020)
Ko, B., Gu, G., Kim, H.G.: Learning with memory-based virtual classes for deep metric learning. In: Proceedings of ICCV (2021)
Krause, J., Deng, J., Stark, M., Li, F.F.: Collecting a large-scale dataset of fine-grained cars. In: Proceedings of ICCV-W (2013)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of CVPR (2017)
Lloyd, S.: Least squares quantization in pcm. TIT 28(2), 129–137 (1982)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)
Mccloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv. 24, 104–169 (1989)
Musgrave, K., Belongie, S., Lim, S.-N.: A metric learning reality check. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 681–699. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_41
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of ICCVGIP (2008)
Noroozi, M., Favaro, P.: Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of CVPR (2016)
Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., Gurevych, I.: AdapterFusion: Non-destructive task composition for transfer learning. In: Proceedings of EACL (2021)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of CVPR (2007)
Philip, J., Berard, A., Gallé, M., Besacier, L.: Monolingual adapters for zero-shot neural machine translation. In: Proceedings of EMNLP (2020)
Puigcerver, J., et al.: Scalable transfer learning with expert models. In: Proceedings of ICLR (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of ICML (2021)
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: Proceedings of NeurIPS (2017)
Revaud, J., Almazán, J., Rezende, R., de Souza, C.: Learning with average precision: Training image retrieval with a listwise loss. In: Proceedings of ICCV (2019)
Riquelme, C., et al.: Scaling vision with sparse mixture of experts. In: Proceedings of NeurIPS (2021)
Sariyildiz, M.B., Kalantidis, Y., Larlus, D., Alahari, K.: Concept generalization in visual representation learning. In: Proceedings of ICCV (2021)
Seidenschwarz, J., Elezi, I., Leal-Taixé, L.: Learning intra-batch connections for deep metric learning. In: Proceedings of ICML (2021)
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: An astounding baseline for recognition. In: Proceedings of CVPR-W (2014)
Shazeer, N., et al.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In: Proceedings of ICLR (2017)
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of NeurIPS (2016)
Tian, Y., Henaff, O.J., van den Oord, A.: Divide and contrast: Self-supervised learning from uncurated data. In: Proceedings of ICCV (2021)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS (2017)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. rep, California Institute of Technology (2011)
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25, 926–930 (2018)
Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: L2 hypersphere embedding for face verification. In: Proceedings of ACM Multimedia (2017)
Wang, R., et al.: K-adapter: Infusing knowledge into pre-trained models with adapters. In: ACL/IJCNLP (Findings) (2021)
Yan, X., Misra, I., Gupta, A., Ghadiyaram, D., Mahajan, D.: Clusterfit: Improving generalization of visual representations. In: Proceedings of CVPR (2020)
You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks. In: Proceedings of ICLR (2018)
Yuksel, S.E., Wilson, J.N., Gader, D.P.: Twenty years of mixture of experts. Trans. Neural Netw. Learn. Syst. 23, 177–1193 (2012)
Zamir, A., Sax, A., Shen, W., Guibas, L., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: Proceedings of CVPR (2018)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: Proceedings of ICML (2021)
Zhai, A., Wu, H.Y.: Classification is a strong baseline for deep metric learning. In: Proceedings of BMVC (2019)
Zhai, X., et al.: LiT: Zero-shot transfer with locked-image text tuning. In: Proceedings of CVPR (2022)
Acknowledgements
MIAI@Grenoble Alpes (ANR-19-P3IA-0003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Almazán, J., Ko, B., Gu, G., Larlus, D., Kalantidis, Y. (2022). Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-19781-9_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19780-2
Online ISBN: 978-3-031-19781-9
eBook Packages: Computer ScienceComputer Science (R0)