Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In the modern era of digital photography and advent of smartphones, millions of images are generated every day and they represent precious moments and events of our lives. As we continue to add images to our digital storehouse, the management and access handling of the images becomes a daunting task, and we lose track unless properly managed. We are in essential need of a tool that can fetch images based on a word or a description. In this paper, we try to build a solution that retrieves relevant images from a pool, based on the description by looking at the content of the image. The model is based on deep neural network architecture and attending to relevant parts of the image. The algorithm takes a sentence or word as input and obtains the top images which are relevant to the caption. We obtain the representation of the sentence and image in a higher dimension, which enables us to compare the two and find the similarity level of both to decide on the relevance. We have conducted various experiments to improve the representation of the image and the caption obtained in the latent space for better correlation, for, e.g., use of bidirectional sequence models for better textual representation, use of various baseline convolution-based stacks for better image representation. We have also tried to incorporate the self-attention mechanism to focus on only the relevant parts of the image and the sentence, thereby enhancing the correlation between the two spaces.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick L. Microsoft COCO: common objects in context. In: Computer vision—ECCV, vol. 8693. Springer, Cham; 2014.

  2. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on ICML; 2015. pp. 2048–57.

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017, pp. 5998–6008.

  4. Weston J, Bengio S, Usunier N, WSABIE: scaling up to large vocabulary image annotation. In: IJCAI international joint conference on artificial intelligence; 2017. pp. 2764–70.

  5. Vendrov I, Kiros R, Fidler S, Urtasun R. Order-embeddings of images and language. In: International conference on learning representations. 2016.

  6. Gu J, Cai J, Joty S, Niu L, Wang G. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. pp. 7181–89.

  7. Chen Y, Wang JZ, Krovetz R. Content-based image retrieval by clustering. In: Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. ACM Press; 2003. pp. 193–200.

  8. Sheikholeslami G, Chang W, Zhang A. SemQuery: semantic clustering and querying on heterogeneous features for visual data. IEEE Trans Knowl Data Eng. 2002;14(5):988–1002.

    Article  Google Scholar 

  9. Smith JR, Chang SF. VisualSEEK: a fully automated content-based query system. In: Proc. 4th acm int’l conf. on multimedia; 1996. pp. 87–8.

  10. Vailaya A, Figueiredo MAT, Jain AK, Zhang H-J. Image classification for content-based indexing. IEEE Trans Image Process. 2001;10(1):117–30.

    Article  Google Scholar 

  11. Maria-Florina B, Dick T, White C. Data-driven clustering via parameterized Lloyd's families. In: Proceedings of the 32nd international conference on neural information processing systems 2018. pp. 10664–74.

  12. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: 3rd International conference for learning representations, San Diego 2015. arXiv:1412.6980.

  13. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2015. pp. 1–9.

  14. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. pp. 770–778.

  15. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. In: Transactions of the association for computational linguistics. 2016.

  16. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: 15th conference of the european chapter of the association for computational linguistics, vol. 2; 2017. pp. 427–31.

  17. Wang S, Wang R, Yao Z, Shan S, Chen X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV; 2020. pp. 1497–506.

  18. Li K, Zhang Y, Li K, Li Y, Fu Y. Visual Semantic Reasoning for Image-Text Matching. In ICCV; 2019. pp. 4653–661.

  19. Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR; 2020. pp. 12655–663.

  20. Wei X, Zhang T, Li Y, Zhang Y, Wu F. 2020. Multi-modality cross attention network for image and sentence matching. In CVPR; 2020. pp. 10941–950.

  21. Lee KH, Chen X, Hua G, Hu H, He X. Stacked cross attention for image-text matching. In ECCV; 2018. pp. 212–28.

Download references

Funding

The authors declare that they have not received any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumanth S. Rao.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Applications of Cloud Computing, Data Analytics and Building Secure Networks” guest edited by Rajnish Sharma, Pao-Ann Hsiung and Sagar Juneja.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rao, S.S., Ikram, S. & Ramesh, P. Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations. SN COMPUT. SCI. 2, 179 (2021). https://doi.org/10.1007/s42979-021-00563-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00563-2

Keywords

Navigation