Abstract
In the modern era of digital photography and advent of smartphones, millions of images are generated every day and they represent precious moments and events of our lives. As we continue to add images to our digital storehouse, the management and access handling of the images becomes a daunting task, and we lose track unless properly managed. We are in essential need of a tool that can fetch images based on a word or a description. In this paper, we try to build a solution that retrieves relevant images from a pool, based on the description by looking at the content of the image. The model is based on deep neural network architecture and attending to relevant parts of the image. The algorithm takes a sentence or word as input and obtains the top images which are relevant to the caption. We obtain the representation of the sentence and image in a higher dimension, which enables us to compare the two and find the similarity level of both to decide on the relevance. We have conducted various experiments to improve the representation of the image and the caption obtained in the latent space for better correlation, for, e.g., use of bidirectional sequence models for better textual representation, use of various baseline convolution-based stacks for better image representation. We have also tried to incorporate the self-attention mechanism to focus on only the relevant parts of the image and the sentence, thereby enhancing the correlation between the two spaces.
Similar content being viewed by others
References
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick L. Microsoft COCO: common objects in context. In: Computer vision—ECCV, vol. 8693. Springer, Cham; 2014.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on ICML; 2015. pp. 2048–57.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017, pp. 5998–6008.
Weston J, Bengio S, Usunier N, WSABIE: scaling up to large vocabulary image annotation. In: IJCAI international joint conference on artificial intelligence; 2017. pp. 2764–70.
Vendrov I, Kiros R, Fidler S, Urtasun R. Order-embeddings of images and language. In: International conference on learning representations. 2016.
Gu J, Cai J, Joty S, Niu L, Wang G. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. pp. 7181–89.
Chen Y, Wang JZ, Krovetz R. Content-based image retrieval by clustering. In: Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. ACM Press; 2003. pp. 193–200.
Sheikholeslami G, Chang W, Zhang A. SemQuery: semantic clustering and querying on heterogeneous features for visual data. IEEE Trans Knowl Data Eng. 2002;14(5):988–1002.
Smith JR, Chang SF. VisualSEEK: a fully automated content-based query system. In: Proc. 4th acm int’l conf. on multimedia; 1996. pp. 87–8.
Vailaya A, Figueiredo MAT, Jain AK, Zhang H-J. Image classification for content-based indexing. IEEE Trans Image Process. 2001;10(1):117–30.
Maria-Florina B, Dick T, White C. Data-driven clustering via parameterized Lloyd's families. In: Proceedings of the 32nd international conference on neural information processing systems 2018. pp. 10664–74.
Kingma DP, Ba J. Adam: a method for stochastic optimization. In: 3rd International conference for learning representations, San Diego 2015. arXiv:1412.6980.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2015. pp. 1–9.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. pp. 770–778.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. In: Transactions of the association for computational linguistics. 2016.
Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: 15th conference of the european chapter of the association for computational linguistics, vol. 2; 2017. pp. 427–31.
Wang S, Wang R, Yao Z, Shan S, Chen X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV; 2020. pp. 1497–506.
Li K, Zhang Y, Li K, Li Y, Fu Y. Visual Semantic Reasoning for Image-Text Matching. In ICCV; 2019. pp. 4653–661.
Chen H, Ding G, Liu X, Lin Z, Liu J, Han J. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR; 2020. pp. 12655–663.
Wei X, Zhang T, Li Y, Zhang Y, Wu F. 2020. Multi-modality cross attention network for image and sentence matching. In CVPR; 2020. pp. 10941–950.
Lee KH, Chen X, Hua G, Hu H, He X. Stacked cross attention for image-text matching. In ECCV; 2018. pp. 212–28.
Funding
The authors declare that they have not received any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Applications of Cloud Computing, Data Analytics and Building Secure Networks” guest edited by Rajnish Sharma, Pao-Ann Hsiung and Sagar Juneja.
Rights and permissions
About this article
Cite this article
Rao, S.S., Ikram, S. & Ramesh, P. Deep Learning-Based Image Retrieval System with Clustering on Attention-Based Representations. SN COMPUT. SCI. 2, 179 (2021). https://doi.org/10.1007/s42979-021-00563-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00563-2