DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13069))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

Abstract

Cross-modal image-text matching is vital for building visual and language relationship. The biggest challenge is to eliminate the heterogeneity in image and text. Existing fine-grained image-text matching methods make great progress in exploring fine-grained correspondence. However, they only use the Cross-Attention method, which ignores the importance of image region semantics and dynamic image-text matching. In this paper, we propose a novel Dynamic Semantic Generation and Similarity Reasoning (DSGSR) network model for image-text matching. Specifically, we use intra-modal relations to enrich the regional features of the image. Then, in consideration of dynamic cross-modal matching, we dynamically generate the query text or image representation according to the retrieved image or text representation. We also introduce the Graph Convolutional Network (GCN) to deal with the effect of neighbor node information on matching accuracy when measuring the image-text similarity. A large number of experiments and analyses show that the DSGSR model surpass state-of-the-art methods on Flickr30K and MSCOCO datasets.

The work is partially supported by the National Natural Science Foundation of China (62072088), Ten Thousand Talent Program (ZX20200035), Liaoning Distinguished Professor (XLYC1902057), and CCF Huawei database innovation research program (No. CCF-HuaweiDBIR001A).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Article 27 January 2021

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. CoRR abs/1707.07998 (2017)
Google Scholar
Chen, T., Luo, J.: Expressing objects just like words: recurrent visual embedding for image-text matching. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 10583–10590. AAAI Press (2020)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3–6, 2018, p. 12. BMVA Press (2018)
Google Scholar
Gao, P., et al.: Dynamic fusion with intra- and inter- modality attention flow for visual question answering. CoRR abs/1812.05252 (2018). http://arxiv.org/abs/1812.05252
Hu, Z., Luo, Y., Lin, J., Yan, Y., Chen, J.: Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 789–795. ijcai.org (2019)
Google Scholar
Huang, Y., Wu, Q., Song, C., Wang, L.: Learning semantic concepts and order for image and sentence matching. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 6163–6171. IEEE Computer Society (2018)
Google Scholar
Ito, T., Tsubouchi, K., Sakaji, H., Yamashita, T., Izumi, K.: Contextual sentiment neural network for document sentiment analysis. Data Sci. Eng. 5(2), 180–192 (2020). https://doi.org/10.1007/s41019-020-00122-4
Jiang, Q., Li, W.: Deep cross-modal hashing. CoRR abs/1602.02255 (2016)
Google Scholar
Karpathy, A., Joulin, A., Li, F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 1889–1897 (2014)
Google Scholar
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Chapter Google Scholar
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 11336–11344. AAAI Press (2020)
Google Scholar
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 1908–1917. IEEE Computer Society (2017)
Google Scholar
Liu, C., Mao, Z., Liu, A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: Amsaleg, L., et al. (eds.) Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21–25, 2019, pp. 3–11. ACM (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, pp. 13–23 (2019)
Google Scholar
Mithun, N.C., Panda, R., Papalexakis, E.E., Roy-Chowdhury, A.K.: Webly supervised joint embedding for cross-modal image-text retrieval. In: Boll, S., et al. (eds.) 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22–26, 2018, pp. 1856–1864. ACM (2018)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 5813–5823. IEEE (2019)
Google Scholar
Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N.: Knowledge aware semantic concept expansion for image-text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 5182–5189. ijcai.org (2019)
Google Scholar
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Liu, Q., et al. (eds.) Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23–27, 2017, pp. 154–162. ACM (2017)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2019)
Article Google Scholar
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020, pp. 1497–1506. IEEE (2020)
Google Scholar
Wang, Y., et al.: Position focused attention network for image-text matching. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp. 3792–3798. ijcai.org (2019)
Google Scholar
Zhang, Q., Lei, Z., Zhang, Z., Li, S.Z.: Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp. 3533–3542. IEEE (2020)
Google Scholar
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 10394–10403. Computer Vision Foundation/IEEE (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Northeastern University, Shenyang, 110819, China
Xiaojing Li, Bin Wang, Xiaohong Zhang & Xiaochun Yang

Authors

Xiaojing Li
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Yiran Chen
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
University of British Columbia, Vancouver, BC, Canada
Jane Wang
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Ruiping Wang
Xidian University, Xi’an, China
Weisheng Dong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Wang, B., Zhang, X., Yang, X. (2021). DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching. In: Fang, L., Chen, Y., Zhai, G., Wang, J., Wang, R., Dong, W. (eds) Artificial Intelligence. CICAI 2021. Lecture Notes in Computer Science(), vol 13069. Springer, Cham. https://doi.org/10.1007/978-3-030-93046-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-93046-2_15
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93045-5
Online ISBN: 978-3-030-93046-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Cross-modal multi-relationship aware reasoning for image-text matching

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation