Abstract
Cross-modal image-text matching is vital for building visual and language relationship. The biggest challenge is to eliminate the heterogeneity in image and text. Existing fine-grained image-text matching methods make great progress in exploring fine-grained correspondence. However, they only use the Cross-Attention method, which ignores the importance of image region semantics and dynamic image-text matching. In this paper, we propose a novel Dynamic Semantic Generation and Similarity Reasoning (DSGSR) network model for image-text matching. Specifically, we use intra-modal relations to enrich the regional features of the image. Then, in consideration of dynamic cross-modal matching, we dynamically generate the query text or image representation according to the retrieved image or text representation. We also introduce the Graph Convolutional Network (GCN) to deal with the effect of neighbor node information on matching accuracy when measuring the image-text similarity. A large number of experiments and analyses show that the DSGSR model surpass state-of-the-art methods on Flickr30K and MSCOCO datasets.
The work is partially supported by the National Natural Science Foundation of China (62072088), Ten Thousand Talent Program (ZX20200035), Liaoning Distinguished Professor (XLYC1902057), and CCF Huawei database innovation research program (No. CCF-HuaweiDBIR001A).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. CoRR abs/1707.07998 (2017)
Chen, T., Luo, J.: Expressing objects just like words: recurrent visual embedding for image-text matching. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 10583–10590. AAAI Press (2020)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3–6, 2018, p. 12. BMVA Press (2018)
Gao, P., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. CoRR abs/1812.05252 (2018). http://arxiv.org/abs/1812.05252
Hu, Z., Luo, Y., Lin, J., Yan, Y., Chen, J.: Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 789–795. ijcai.org (2019)
Huang, Y., Wu, Q., Song, C., Wang, L.: Learning semantic concepts and order for image and sentence matching. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 6163–6171. IEEE Computer Society (2018)
Ito, T., Tsubouchi, K., Sakaji, H., Yamashita, T., Izumi, K.: Contextual sentiment neural network for document sentiment analysis. Data Sci. Eng. 5(2), 180–192 (2020). https://doi.org/10.1007/s41019-020-00122-4
Jiang, Q., Li, W.: Deep cross-modal hashing. CoRR abs/1602.02255 (2016)
Karpathy, A., Joulin, A., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 1889–1897 (2014)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 11336–11344. AAAI Press (2020)
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 1908–1917. IEEE Computer Society (2017)
Liu, C., Mao, Z., Liu, A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: Amsaleg, L., et al. (eds.) Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21–25, 2019, pp. 3–11. ACM (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, pp. 13–23 (2019)
Mithun, N.C., Panda, R., Papalexakis, E.E., Roy-Chowdhury, A.K.: Webly supervised joint embedding for cross-modal image-text retrieval. In: Boll, S., et al. (eds.) 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22–26, 2018, pp. 1856–1864. ACM (2018)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 5813–5823. IEEE (2019)
Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N.: Knowledge aware semantic concept expansion for image-text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 5182–5189. ijcai.org (2019)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Liu, Q., et al. (eds.) Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23–27, 2017, pp. 154–162. ACM (2017)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2019)
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020, pp. 1497–1506. IEEE (2020)
Wang, Y., et al.: Position focused attention network for image-text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 3792–3798. ijcai.org (2019)
Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 3533–3542. IEEE (2020)
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 10394–10403. Computer Vision Foundation/IEEE (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X., Wang, B., Zhang, X., Yang, X. (2021). DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13069. Springer, Cham. https://doi.org/10.1007/978-3-030-93046-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-93046-2_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93045-5
Online ISBN: 978-3-030-93046-2
eBook Packages: Computer ScienceComputer Science (R0)