Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Fine-grained image retrieval by combining attention mechanism and context information

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recently, fine-grained image retrieval (FGIR) has become a hot topic in computer vision. Most of the advanced retrieval algorithms in this field mainly focus on the construction of loss function and the design of hard sample mining strategy. In this paper, we improve the performance of the FGIR algorithm from another perspective and propose an attention mechanism and context Information constraints-based image retrieval (AMCICIR) method for FGIR. It first applies an attention learning mechanism to gradually refine object location and extracts useful local features from coarse to fine. Then, it uses an improved graph convolutional network (GCN), where the adjacency matrix is dynamically adjusted with the current features and model retrieval performances during the model learning, to model the internal semantic interactions of the learned local features, so as to obtain a more discriminative and fine-grained image representation. Finally, various experiments are conducted on two fine-grained image datasets, CUB-200-2011 and Cars-196, and the experimental results show that the AMCICIR algorithm can outperform pervious state-of-the-art works remarkably.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The data that support the findings of this study are openly available in Kaggle at https://www.kaggle.com/datasets/veeralakrishna/200-bird-species-with-11788-images and https://www.kaggle.com/datasets/ryanholbrook/cars196.

References

  1. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology

  2. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561

  3. Khosla A, Jayadevaprakash N, Yao B, Li F-F (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol 2. Citeseer

  4. Zhang X, Wang S, Li Z, Ma S (2017) Landmark image retrieval by jointing feature refinement and multimodal classifier learning. IEEE Trans Cybern 48(6):1682–1695

    Article  Google Scholar 

  5. D’Innocente A, Garg N, Zhang Y, Bazzani L, Donoser M (2021) Localized triplet loss for fine-grained fashion image retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3910–3915

  6. Dubey SR, Roy SK, Chakraborty S, Mukherjee S, Chaudhuri BB (2020) Local bit-plane decoded convolutional neural network features for biomedical image retrieval. Neural Comput Appl 32(11):7539–7551

    Article  Google Scholar 

  7. Radenovi F, Tolias G, Chum O (2018) Fine-tuning CNN image retrieval with no human annotation. IEEE Trans Pattern Anal Mach Intell 41(7):1655–1668

    Article  Google Scholar 

  8. Kim S, Seo M, Laptev I, Cho M, Kwak S (2019) Deep metric learning beyond binary supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2288–2297

  9. Zeng X, Liu S, Wang X, Zhang Y, Chen K, Li D (2021) Hard decorrelated centralized loss for fine-grained image retrieval. Neurocomputing 453:26–37

    Article  Google Scholar 

  10. Wang X, Han X, Huang W, Dong D, Scott MR (2019) Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5022–5030

  11. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  12. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  13. Wang W, Cui Y, Li G, Jiang C, Deng S (2020) A self-attention-based destruction and construction learning fine-grained image classification method for retail product recognition. Neural Comput Appl 32(18):14613–14622

    Article  Google Scholar 

  14. Sa L, Yu C, Ma X, Zhao X, Xie T (2022) Attentive fine-grained recognition for cross-domain few-shot classification. Neural Comput Appl 34(6):4733–4746

    Article  Google Scholar 

  15. Lin H, Song Y, Zeng Z, Wang W, Wang J (2021) Aggregating object features based on attention weights for fine-grained image retrieval. In: 2020 25th international conference on pattern recognition (ICPR). IEEE, pp 2838–2844

  16. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  17. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  18. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, pp 6000–6010

  19. Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891

  20. Ranjan N, Mundada K, Phaltane K, Ahmad S (2016) A survey on techniques in NLP. Int J Comput Appl 134(8):6–9

    Google Scholar 

  21. Zhang Y, Yu X, Cui Z, Wu S, Wen Z, Wang L (2020) Every document owns its structure: inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826

  22. Liu X, You X, Zhang X, Wu J, Lv P (2020) Tensor graph convolutional networks for text classification. In: Proceedings of the AAAI conference on artificial intelligence, pp 8409–8416

  23. Tu M, Wang G, Huang J, Tang Y, He X, Zhou B (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv preprint arXiv:1905.07374

  24. Visin F, Ciccone M, Romero A, Kastner K, Cho K, Bengio Y, Matteucci M, Courville A (2016) Reseg: a recurrent neural network-based model for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 41–48

  25. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  26. Yuan Y, Chen X, Wang J (2019) Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065

  27. Zhou B, Liu X, Liu Y, Huang Y, Liò P, Wang Y (2021) Spectral transform forms scalable transformer. arXiv preprint arXiv:2111.07602

  28. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Proceedings of the 30th international conference on neural information processing systems, pp 3844–3852

  29. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 32nd AAAI conference on artificial intelligence

  30. Chen Z, Li S, Yang B, Li Q, Liu H (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1113–1122

  31. Gao J, Zhang T, Xu C (2019) I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8303–8311

  32. Hu T, Xu J, Huang C, Qi H, Huang Q, Lu Y (2018) Weakly supervised bilinear attention network for fine-grained visual classification. arXiv preprint arXiv:1808.02152

  33. Cao G, Zhu Y, Lu X (2021) Fine-grained image retrieval via multiple part-level feature ensemble. In: 2021 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6

  34. Ohsong H, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4004–4012

  35. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  36. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  37. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  38. Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2. IEEE, pp 1735–1742

  39. Hu J, Lu J, Tan Y-P (2014) Discriminative deep metric learning for face verification in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1875–1882

  40. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: Advances in neural information processing systems, pp 1857–1865

  41. Movshovitz-Attias Y, Toshev A, Leung TK, Ioffe S, Singh S (2017) No fuss distance metric learning using proxies. In: Proceedings of the IEEE international conference on computer vision, pp 360–368

  42. Wu C-Y, Manmatha R, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: Proceedings of the IEEE international conference on computer vision, pp 2840–2848

  43. Roth K, Brattoli B, Ommer B (2019) Mic: mining interclass characteristics for improved metric learning. In: Proceedings of the IEEE international conference on computer vision, pp 8000–8009

  44. Yuan Y, Yang K, Zhang C (2017) Hard-aware deeply cascaded embedding. In: Proceedings of the IEEE international conference on computer vision, pp 814–823

  45. Opitz M, Waltner G, Possegger H, Bischof H (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE Trans Pattern Anal Mach Intell

  46. Kim W, Goyal B, Chawla K, Lee J, Kwon K (2018) Attention-based ensemble for deep metric learning. In: Proceedings of the European conference on computer vision, pp 736–751

  47. Ge W (2018) Deep metric learning with hierarchical triplet loss. In: Proceedings of the European conference on computer vision, pp 269–28

  48. Zheng X, Ji R, Sun X, Zhang B, Wu Y, Huang F. Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9291–9298

  49. Zeng X, Zhang Y, Wang X, Chen K, Li D, Yang W (2020) Fine-grained image retrieval via piecewise cross entropy loss. Image Vis Comput 93:103820

    Article  Google Scholar 

  50. Kim S, Kim D, Cho M, Kwak S (2020) Proxy anchor loss for deep metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3238–3247

  51. Seidenschwarz JD, Elezi I, Leal-Taixé L (2021) Learning intra-batch connections for deep metric learning. In: International conference on machine learning. PMLR, pp 9410–9421

  52. Wei X-S, Luo J-H, Wu J, Zhou Z-H (2017) Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans Image Process 26(6):2868–2881

    Article  MathSciNet  MATH  Google Scholar 

  53. Zheng X, Ji R, Sun X, Wu Y, Huang F, Yang Y (2018) Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In: IJCAI, pp 1226–1233

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China under the Grant 62071171.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinwen Ma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

1.1 A. 1 Relaxing \(D_{AS}\) to \(D'_{AS}\)

The reasons for relaxing \(D_{AS}\) to \(D'_{AS}\) are as following:

  1. (1)

    There is a strong correlation between \(\max\) and log-sum-exp:

    $$\begin{aligned} \exp \left({\max _k(d(x^1_k, x^2_k))}\right)&\le \sum _{k=1}^{K}\exp (d(x^1_k, x^2_k)) \nonumber \\&\le K\ \exp (\max _k(d(x^1_k, x^2_k))) \end{aligned}$$
    (A1)
    $$\begin{aligned}&\Rightarrow \max _k \{d(x^1_k, x^2_k)\} \le \log \sum _{k=1}^{K}\exp (d(x^1_k, x^2_k)) \nonumber \\&\le \max _k \{d(x^1_k, x^2_k)\} + \log \ K \end{aligned}$$
    (A2)
    $$\begin{aligned}&\Rightarrow D_{AS}(X_1, X_2) \le D'_{AS}(X_1, X_2) \nonumber \\&\le D_{AS}(X_1, X_2) + \log \ K \end{aligned}$$
    (A3)

    Therefore, \(D'_{AS}(X_1, X_2)\) can be used to approximate \(D_{AS}(X_1, X_2)\).

  2. (2)

    Compared with \(D_{AS}(X_1, X_2)\), \(D'_{AS}(X_1, X_2)\) is a smooth function, and the calculation of the gradient in the neural network learning has stronger robustness.

  3. (3)

    In the gradient back propagation process of model training, \(D'_{AS}\) can provide abundant gradient information than \(D_{AS}\):

    $$\begin{aligned}&\frac{\partial }{\partial d(x^1_k, x^2_k)} D_{AS}(X_1, X_2) \nonumber \\&\quad = \quad {\left\{ \begin{array}{ll} 0, d(x^1_k, x^2_k) \ne \max _k(d(x^1_k, x^2_k)) \\ 1, d(x^1_k, x^2_k) = \max _k(d(x^1_k, x^2_k)) \end{array}\right. } \end{aligned}$$
    (A4)
    $$\begin{aligned}&\frac{\partial }{\partial d(x^1_k, x^2_k)} D'_{AS}(X_1, X_2) = \frac{\exp (d(x^1_k, x^2_k))}{\sum _{i=1}^{K}\exp (d(x^1_i, x^2_i))} \end{aligned}$$
    (A5)

    In the model training, \(D_{AS}\) only updates the parameters of the part feature with the largest distance between the two images and ignores other features, thus affecting the learning process of the retrieval network. \(D'_{AS}\) updates the parameters of all part features, so it is better than \(D_{AS}\) in speed of model training.

1.2 A. 2 AMCICIR using different loss functions and hard sample mining strategies

We also explore the effectiveness of the AMCICIR method using different loss functions and different hard sample mining strategies. In this section, we conduct the following four setups for the AMCICIR method, respectively. AMCICIR\(_T\): training the AMCIC model with the triplet loss function. AMCICIR\(_N\): training the AMCIC model with the Npair loss function. AMCICIR\(_T^S\): training AMCIC model the with the triplet loss function and softhard hard mining strategy. AMCICIR\(_M^D\): training AMCIC model the with the margin loss function and distance hard mining strategy. The backbone networks of the above four algorithms are ResNet50. Table 4 shows their performance on CUB-200-2011. We can see, with the addition of loss function and hard sample mining strategy, the retrieval ability of our method has been further improved. The more effective the loss function and the hard sample mining strategy are, the better the retrieval effect of the algorithm is. Especially, AMCICIR\(_M^D\) is combined with the margin loss function and distance hard mining strategy and the Recall@1–2–4–8 on CUB-200-2011 reaches to 70.9–81.4–89.1–93.6, which outperforms pervious state-of-the-art works remarkably.

Table 4 Recall@K (%) of AMCICIR with different losses and hard mining strategies on CUB-200-2011

1.3 A. 3 The reason for choosing the reward function Eq. (10)

In this paper, we are inspired by the AdaBoost algorithm to choose the reward function ( \(\gamma ^{(t)} = \frac{1}{2}ln \frac{r^{(t)}}{1-r^{(t)}}\)). In the AdaBoost algorithm, the weight of the m-th weak classifier is computed as follows:

$$\begin{aligned} \alpha _{m} = \frac{1}{2}ln \frac{1-e_m}{e_m} \end{aligned}$$
(6)

where \(e_m\) represents the classification error rate on the training dataset. When \(e_m \le \frac{1}{2}\) and \(\alpha _{m} \ge 0\), \(\alpha _{m}\) increases as \(e_m\) decreases. Therefore, weak classifier with smaller classification error plays a more important role in the final classifier. For Eq. (10) in our paper, when \(r^{(t)} \ge \frac{1}{2}\) and \(\gamma ^{(t)} \ge 0\), \(\gamma ^{(t)}\) increases as \(r^{(t)}\) increases. This means that higher recall@1 will bring more predominant effect to the model training, and vice versa.

1.4 A. 4 The recall of AMCICIR at different training iterations

In experiments, we also performed ablation experiments on batch size. We set the batch size to 8, 16, 24, 32, respectively. It is found that with the increase in batch size, the convergence of the model is faster and the oscillation is smaller. In addition, the compared FGIR papers that use the same datasets as our paper often set large batch size. For example, the batch sizes of HTL [47], MPFE [33] and DGCRL [48] are set to 650, 80 and 60, respectively. Generally speaking, within a reasonable range, large batch sizes are helpful for model training. However, the experiments in our paper are carried out on two Tesla P100 GPUs. Due to the limitations of our experimental equipment, the batch size cannot be set to large, and can only be set to 32 at most. In the training process, we analyze the decline curve of the loss function and find that the degree of overfitting is not obvious. So, the batch size in our experiment is set to 32. The following figure shows the trend of the recall of our AMCICIR method at different training iterations on CUB-200-2011.

Fig. 11
figure 11

The recall of AMCICIR on CUB-200-2011

During training, for each training iteration, AMCICIR calculates the recall@1 and uses this to dynamically update the adjacency matrix of GCN to complete one iteration. At the same time, on the test set, we test the model obtained after each epoch (about 184 training iterations) training and calculate the recall@1, recall@2, recall@4 and recall@8. From Fig. 11, we can see that, whether on the training set or on the test set, with the increase in the number of training iterations, the retrieval effect of the model is gradually improved, and the original oscillation tends to be gentle.

1.5 A.5 Comparison using more evaluation protocols

In order to demonstrate the superiority of our AMCICIR method from multiple perspectives, we also use another evaluation protocol, Precision@K, to comprehensively display the performance of the method. For the sake of fairness, we only compare our method with previous state-of-the-art FGIR methods that have published the corresponding precision results or source codes. They are SCDA [52], CRL [53] and DGCRL [48], respectively. In this experiment, we compare these FGIR methods using Recall@K (where the values of K are 1, 2, 4 and 8) and Precision@K(where the values of K are 1, 5 and 10). The settings of this experiment retain the same as in the Experiments and Results Sect. 4. And, the backbone network of all methods is ResNet50. Table 5 shows the performance of each menthod on CUB-200-2011. It can be seen from the experimental results that our method outperforms other previous state-of-the-art FGIR methods in terms of recall and precision. This further proves the superiority of our method.

Table 5 Recall@K (%) and Precision@K (%) on CUB-200-2011 with State-of-the-Art methods

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Ma, J. Fine-grained image retrieval by combining attention mechanism and context information. Neural Comput & Applic 35, 1881–1897 (2023). https://doi.org/10.1007/s00521-022-07873-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07873-3

Keywords

Navigation