Abstract
Recently, fine-grained image retrieval (FGIR) has become a hot topic in computer vision. Most of the advanced retrieval algorithms in this field mainly focus on the construction of loss function and the design of hard sample mining strategy. In this paper, we improve the performance of the FGIR algorithm from another perspective and propose an attention mechanism and context Information constraints-based image retrieval (AMCICIR) method for FGIR. It first applies an attention learning mechanism to gradually refine object location and extracts useful local features from coarse to fine. Then, it uses an improved graph convolutional network (GCN), where the adjacency matrix is dynamically adjusted with the current features and model retrieval performances during the model learning, to model the internal semantic interactions of the learned local features, so as to obtain a more discriminative and fine-grained image representation. Finally, various experiments are conducted on two fine-grained image datasets, CUB-200-2011 and Cars-196, and the experimental results show that the AMCICIR algorithm can outperform pervious state-of-the-art works remarkably.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are openly available in Kaggle at https://www.kaggle.com/datasets/veeralakrishna/200-bird-species-with-11788-images and https://www.kaggle.com/datasets/ryanholbrook/cars196.
References
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561
Khosla A, Jayadevaprakash N, Yao B, Li F-F (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol 2. Citeseer
Zhang X, Wang S, Li Z, Ma S (2017) Landmark image retrieval by jointing feature refinement and multimodal classifier learning. IEEE Trans Cybern 48(6):1682–1695
D’Innocente A, Garg N, Zhang Y, Bazzani L, Donoser M (2021) Localized triplet loss for fine-grained fashion image retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3910–3915
Dubey SR, Roy SK, Chakraborty S, Mukherjee S, Chaudhuri BB (2020) Local bit-plane decoded convolutional neural network features for biomedical image retrieval. Neural Comput Appl 32(11):7539–7551
Radenovi F, Tolias G, Chum O (2018) Fine-tuning CNN image retrieval with no human annotation. IEEE Trans Pattern Anal Mach Intell 41(7):1655–1668
Kim S, Seo M, Laptev I, Cho M, Kwak S (2019) Deep metric learning beyond binary supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2288–2297
Zeng X, Liu S, Wang X, Zhang Y, Chen K, Li D (2021) Hard decorrelated centralized loss for fine-grained image retrieval. Neurocomputing 453:26–37
Wang X, Han X, Huang W, Dong D, Scott MR (2019) Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5022–5030
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Wang W, Cui Y, Li G, Jiang C, Deng S (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 32(18):14613–14622
Sa L, Yu C, Ma X, Zhao X, Xie T (2022) Attentive fine-grained recognition for cross-domain few-shot classification. Neural Comput Appl 34(6):4733–4746
Lin H, Song Y, Zeng Z, Wang W, Wang J (2021) Aggregating object features based on attention weights for fine-grained image retrieval. In: 2020 25th international conference on pattern recognition (ICPR). IEEE, pp 2838–2844
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, pp 6000–6010
Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891
Ranjan N, Mundada K, Phaltane K, Ahmad S (2016) A survey on techniques in NLP. Int J Comput Appl 134(8):6–9
Zhang Y, Yu X, Cui Z, Wu S, Wen Z, Wang L (2020) Every document owns its structure: inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826
Liu X, You X, Zhang X, Wu J, Lv P (2020) Tensor graph convolutional networks for text classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 8409–8416
Tu M, Wang G, Huang J, Tang Y, He X, Zhou B (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv preprint arXiv:1905.07374
Visin F, Ciccone M, Romero A, Kastner K, Cho K, Bengio Y, Matteucci M, Courville A (2016) Reseg: a recurrent neural network-based model for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 41–48
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Yuan Y, Chen X, Wang J (2019) Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065
Zhou B, Liu X, Liu Y, Huang Y, Liò P, Wang Y (2021) Spectral transform forms scalable transformer. arXiv preprint arXiv:2111.07602
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Proceedings of the 30th international conference on neural information processing systems, pp 3844–3852
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 32nd AAAI conference on artificial intelligence
Chen Z, Li S, Yang B, Li Q, Liu H (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1113–1122
Gao J, Zhang T, Xu C (2019) I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8303–8311
Hu T, Xu J, Huang C, Qi H, Huang Q, Lu Y (2018) Weakly supervised bilinear attention network for fine-grained visual classification. arXiv preprint arXiv:1808.02152
Cao G, Zhu Y, Lu X (2021) Fine-grained image retrieval via multiple part-level feature ensemble. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6
Ohsong H, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4004–4012
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2. IEEE, pp 1735–1742
Hu J, Lu J, Tan Y-P (2014) Discriminative deep metric learning for face verification in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1875–1882
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems, pp 1857–1865
Movshovitz-Attias Y, Toshev A, Leung TK, Ioffe S, Singh S (2017) No fuss distance metric learning using proxies. In: Proceedings of the IEEE international conference on computer vision, pp 360–368
Wu C-Y, Manmatha R, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: Proceedings of the IEEE international conference on computer vision, pp 2840–2848
Roth K, Brattoli B, Ommer B (2019) Mic: mining interclass characteristics for improved metric learning. In: Proceedings of the IEEE international conference on computer vision, pp 8000–8009
Yuan Y, Yang K, Zhang C (2017) Hard-aware deeply cascaded embedding. In: Proceedings of the IEEE international conference on computer vision, pp 814–823
Opitz M, Waltner G, Possegger H, Bischof H (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE Trans Pattern Anal Mach Intell
Kim W, Goyal B, Chawla K, Lee J, Kwon K (2018) Attention-based ensemble for deep metric learning. In: Proceedings of the European conference on computer vision, pp 736–751
Ge W (2018) Deep metric learning with hierarchical triplet loss. In: Proceedings of the European conference on computer vision, pp 269–28
Zheng X, Ji R, Sun X, Zhang B, Wu Y, Huang F. Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9291–9298
Zeng X, Zhang Y, Wang X, Chen K, Li D, Yang W (2020) Fine-grained image retrieval via piecewise cross entropy loss. Image Vis Comput 93:103820
Kim S, Kim D, Cho M, Kwak S (2020) Proxy anchor loss for deep metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3238–3247
Seidenschwarz JD, Elezi I, Leal-Taixé L (2021) Learning intra-batch connections for deep metric learning. In: International conference on machine learning. PMLR, pp 9410–9421
Wei X-S, Luo J-H, Wu J, Zhou Z-H (2017) Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans Image Process 26(6):2868–2881
Zheng X, Ji R, Sun X, Wu Y, Huang F, Yang Y (2018) Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In: IJCAI, pp 1226–1233
Acknowledgements
This work was supported by the Natural Science Foundation of China under the Grant 62071171.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
1.1 A. 1 Relaxing \(D_{AS}\) to \(D'_{AS}\)
The reasons for relaxing \(D_{AS}\) to \(D'_{AS}\) are as following:
-
(1)
There is a strong correlation between \(\max\) and log-sum-exp:
$$\begin{aligned} \exp \left({\max _k(d(x^1_k, x^2_k))}\right)&\le \sum _{k=1}^{K}\exp (d(x^1_k, x^2_k)) \nonumber \\&\le K\ \exp (\max _k(d(x^1_k, x^2_k))) \end{aligned}$$(A1)$$\begin{aligned}&\Rightarrow \max _k \{d(x^1_k, x^2_k)\} \le \log \sum _{k=1}^{K}\exp (d(x^1_k, x^2_k)) \nonumber \\&\le \max _k \{d(x^1_k, x^2_k)\} + \log \ K \end{aligned}$$(A2)$$\begin{aligned}&\Rightarrow D_{AS}(X_1, X_2) \le D'_{AS}(X_1, X_2) \nonumber \\&\le D_{AS}(X_1, X_2) + \log \ K \end{aligned}$$(A3)Therefore, \(D'_{AS}(X_1, X_2)\) can be used to approximate \(D_{AS}(X_1, X_2)\).
-
(2)
Compared with \(D_{AS}(X_1, X_2)\), \(D'_{AS}(X_1, X_2)\) is a smooth function, and the calculation of the gradient in the neural network learning has stronger robustness.
-
(3)
In the gradient back propagation process of model training, \(D'_{AS}\) can provide abundant gradient information than \(D_{AS}\):
$$\begin{aligned}&\frac{\partial }{\partial d(x^1_k, x^2_k)} D_{AS}(X_1, X_2) \nonumber \\&\quad = \quad {\left\{ \begin{array}{ll} 0, d(x^1_k, x^2_k) \ne \max _k(d(x^1_k, x^2_k)) \\ 1, d(x^1_k, x^2_k) = \max _k(d(x^1_k, x^2_k)) \end{array}\right. } \end{aligned}$$(A4)$$\begin{aligned}&\frac{\partial }{\partial d(x^1_k, x^2_k)} D'_{AS}(X_1, X_2) = \frac{\exp (d(x^1_k, x^2_k))}{\sum _{i=1}^{K}\exp (d(x^1_i, x^2_i))} \end{aligned}$$(A5)In the model training, \(D_{AS}\) only updates the parameters of the part feature with the largest distance between the two images and ignores other features, thus affecting the learning process of the retrieval network. \(D'_{AS}\) updates the parameters of all part features, so it is better than \(D_{AS}\) in speed of model training.
1.2 A. 2 AMCICIR using different loss functions and hard sample mining strategies
We also explore the effectiveness of the AMCICIR method using different loss functions and different hard sample mining strategies. In this section, we conduct the following four setups for the AMCICIR method, respectively. AMCICIR\(_T\): training the AMCIC model with the triplet loss function. AMCICIR\(_N\): training the AMCIC model with the Npair loss function. AMCICIR\(_T^S\): training AMCIC model the with the triplet loss function and softhard hard mining strategy. AMCICIR\(_M^D\): training AMCIC model the with the margin loss function and distance hard mining strategy. The backbone networks of the above four algorithms are ResNet50. Table 4 shows their performance on CUB-200-2011. We can see, with the addition of loss function and hard sample mining strategy, the retrieval ability of our method has been further improved. The more effective the loss function and the hard sample mining strategy are, the better the retrieval effect of the algorithm is. Especially, AMCICIR\(_M^D\) is combined with the margin loss function and distance hard mining strategy and the Recall@1–2–4–8 on CUB-200-2011 reaches to 70.9–81.4–89.1–93.6, which outperforms pervious state-of-the-art works remarkably.
1.3 A. 3 The reason for choosing the reward function Eq. (10)
In this paper, we are inspired by the AdaBoost algorithm to choose the reward function ( \(\gamma ^{(t)} = \frac{1}{2}ln \frac{r^{(t)}}{1-r^{(t)}}\)). In the AdaBoost algorithm, the weight of the m-th weak classifier is computed as follows:
where \(e_m\) represents the classification error rate on the training dataset. When \(e_m \le \frac{1}{2}\) and \(\alpha _{m} \ge 0\), \(\alpha _{m}\) increases as \(e_m\) decreases. Therefore, weak classifier with smaller classification error plays a more important role in the final classifier. For Eq. (10) in our paper, when \(r^{(t)} \ge \frac{1}{2}\) and \(\gamma ^{(t)} \ge 0\), \(\gamma ^{(t)}\) increases as \(r^{(t)}\) increases. This means that higher recall@1 will bring more predominant effect to the model training, and vice versa.
1.4 A. 4 The recall of AMCICIR at different training iterations
In experiments, we also performed ablation experiments on batch size. We set the batch size to 8, 16, 24, 32, respectively. It is found that with the increase in batch size, the convergence of the model is faster and the oscillation is smaller. In addition, the compared FGIR papers that use the same datasets as our paper often set large batch size. For example, the batch sizes of HTL [47], MPFE [33] and DGCRL [48] are set to 650, 80 and 60, respectively. Generally speaking, within a reasonable range, large batch sizes are helpful for model training. However, the experiments in our paper are carried out on two Tesla P100 GPUs. Due to the limitations of our experimental equipment, the batch size cannot be set to large, and can only be set to 32 at most. In the training process, we analyze the decline curve of the loss function and find that the degree of overfitting is not obvious. So, the batch size in our experiment is set to 32. The following figure shows the trend of the recall of our AMCICIR method at different training iterations on CUB-200-2011.
During training, for each training iteration, AMCICIR calculates the recall@1 and uses this to dynamically update the adjacency matrix of GCN to complete one iteration. At the same time, on the test set, we test the model obtained after each epoch (about 184 training iterations) training and calculate the recall@1, recall@2, recall@4 and recall@8. From Fig. 11, we can see that, whether on the training set or on the test set, with the increase in the number of training iterations, the retrieval effect of the model is gradually improved, and the original oscillation tends to be gentle.
1.5 A.5 Comparison using more evaluation protocols
In order to demonstrate the superiority of our AMCICIR method from multiple perspectives, we also use another evaluation protocol, Precision@K, to comprehensively display the performance of the method. For the sake of fairness, we only compare our method with previous state-of-the-art FGIR methods that have published the corresponding precision results or source codes. They are SCDA [52], CRL [53] and DGCRL [48], respectively. In this experiment, we compare these FGIR methods using Recall@K (where the values of K are 1, 2, 4 and 8) and Precision@K(where the values of K are 1, 5 and 10). The settings of this experiment retain the same as in the Experiments and Results Sect. 4. And, the backbone network of all methods is ResNet50. Table 5 shows the performance of each menthod on CUB-200-2011. It can be seen from the experimental results that our method outperforms other previous state-of-the-art FGIR methods in terms of recall and precision. This further proves the superiority of our method.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Ma, J. Fine-grained image retrieval by combining attention mechanism and context information. Neural Comput & Applic 35, 1881–1897 (2023). https://doi.org/10.1007/s00521-022-07873-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07873-3