Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations

531 Accesses
3 Citations
Explore all metrics

Abstract

In the modern era of digital photography and advent of smartphones, millions of images are generated every day and they represent precious moments and events of our lives. As we continue to add images to our digital storehouse, the management and access handling of the images becomes a daunting task, and we lose track unless properly managed. We are in essential need of a tool that can fetch images based on a word or a description. In this paper, we try to build a solution that retrieves relevant images from a pool, based on the description by looking at the content of the image. The model is based on deep neural network architecture and attending to relevant parts of the image. The algorithm takes a sentence or word as input and obtains the top images which are relevant to the caption. We obtain the representation of the sentence and image in a higher dimension, which enables us to compare the two and find the similarity level of both to decide on the relevance. We have conducted various experiments to improve the representation of the image and the caption obtained in the latent space for better correlation, for, e.g., use of bidirectional sequence models for better textual representation, use of various baseline convolution-based stacks for better image representation. We have also tried to incorporate the self-attention mechanism to focus on only the relevant parts of the image and the sentence, thereby enhancing the correlation between the two spaces.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An image retrieval method based on semantic matching with multiple positional representations

Article 10 October 2019

Looking deeper and transferring attention for image captioning

Article 05 June 2018

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

References

Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick L. Microsoft COCO: common objects in context. In: Computer vision—ECCV, vol. 8693. Springer, Cham; 2014.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on ICML; 2015. pp. 2048–57.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017, pp. 5998–6008.
Weston J, Bengio S, Usunier N, WSABIE: scaling up to large vocabulary image annotation. In: IJCAI international joint conference on artificial intelligence; 2017. pp. 2764–70.
Vendrov I, Kiros R, Fidler S, Urtasun R. Order-embeddings of images and language. In: International conference on learning representations. 2016.
Gu J, Cai J, Joty S, Niu L, Wang G. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. pp. 7181–89.
Chen Y, Wang JZ, Krovetz R. Content-based image retrieval by clustering. In: Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. ACM Press; 2003. pp. 193–200.
Sheikholeslami G, Chang W, Zhang A. SemQuery: semantic clustering and querying on heterogeneous features for visual data. IEEE Trans Knowl Data Eng. 2002;14(5):988–1002.
Article Google Scholar
Smith JR, Chang SF. VisualSEEK: a fully automated content-based query system. In: Proc. 4th acm int’l conf. on multimedia; 1996. pp. 87–8.
Vailaya A, Figueiredo MAT, Jain AK, Zhang H-J. Image classification for content-based indexing. IEEE Trans Image Process. 2001;10(1):117–30.
Article Google Scholar
Maria-Florina B, Dick T, White C. Data-driven clustering via parameterized Lloyd's families. In: Proceedings of the 32nd international conference on neural information processing systems 2018. pp. 10664–74.
Kingma DP, Ba J. Adam: a method for stochastic optimization. In: 3rd International conference for learning representations, San Diego 2015. arXiv:1412.6980.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2015. pp. 1–9.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. pp. 770–778.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. In: Transactions of the association for computational linguistics. 2016.
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: 15th conference of the european chapter of the association for computational linguistics, vol. 2; 2017. pp. 427–31.
Wang S, Wang R, Yao Z, Shan S, Chen X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV; 2020. pp. 1497–506.
Li K, Zhang Y, Li K, Li Y, Fu Y. Visual Semantic Reasoning for Image-Text Matching. In ICCV; 2019. pp. 4653–661.
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR; 2020. pp. 12655–663.
Wei X, Zhang T, Li Y, Zhang Y, Wu F. 2020. Multi-modality cross attention network for image and sentence matching. In CVPR; 2020. pp. 10941–950.
Lee KH, Chen X, Hua G, Hu H, He X. Stacked cross attention for image-text matching. In ECCV; 2018. pp. 212–28.

Download references

Funding

The authors declare that they have not received any funding.

Author information

Authors and Affiliations

Department of Computer Science, PES University, Bangalore, Karnataka, India
Sumanth S. Rao, Shahid Ikram & Parashara Ramesh

Authors

Sumanth S. Rao
View author publications
You can also search for this author in PubMed Google Scholar
Shahid Ikram
View author publications
You can also search for this author in PubMed Google Scholar
Parashara Ramesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sumanth S. Rao.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Applications of Cloud Computing, Data Analytics and Building Secure Networks” guest edited by Rajnish Sharma, Pao-Ann Hsiung and Sagar Juneja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, S.S., Ikram, S. & Ramesh, P. Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations. SN COMPUT. SCI. 2, 179 (2021). https://doi.org/10.1007/s42979-021-00563-2

Download citation

Received: 19 December 2020
Accepted: 03 March 2021
Published: 30 March 2021
DOI: https://doi.org/10.1007/s42979-021-00563-2

Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An image retrieval method based on semantic matching with multiple positional representations

Looking deeper and transferring attention for image captioning

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An image retrieval method based on semantic matching with multiple positional representations

Looking deeper and transferring attention for image captioning

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now