research-article

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Authors:

Xuanjing Huang,

Jianqing FanAuthors Info & Claims

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Pages 137 - 145

https://doi.org/10.1145/3512527.3531368

Published: 27 June 2022 Publication History

Abstract

Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.

Supplementary Material

MP4 File (ICMR22-fp95.mp4)

ICMR2022 Presentation Video: Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Download
28.87 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[3]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV.

Digital Library

[4]

Michael Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29 (2016), 3844--3852.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[6]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). https://github.com/fartashf/vsepp

[7]

Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, and Jianqing Fan. 2021. Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval. arXiv preprint arXiv:2111.03349 (2021).

[8]

Zhihao Fan, Zhongyu Wei, Siyuan Wang, and Xuan-Jing Huang. 2019. Bridging by word: Image grounded vocabulary construction for visual captioning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6514--6524.

[9]

Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun Shan, and Xuanjing Huang. 2021. TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. arXiv preprint arXiv:2106.10936 (2021).

[10]

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc., 2121--2129. https://proceedings.neurips.cc/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf

[11]

Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5754--5763.

[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.

[13]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).

[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).

[15]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2017).

[16]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.

Digital Library

[17]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence.

[18]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.

[19]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.

[20]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3--11.

Digital Library

[21]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921--10930.

[22]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE conference on computer vision and pattern recognition. 299--307.

[23]

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.

Digital Library

[24]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).

[25]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.

[26]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.

[27]

Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.

[28]

Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019. Position focused attention network for image-text matching. In IJCAI.

[29]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764--5773.

[30]

Jiwei Wei, Xing Xu, Zheng Wang, and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In ACM MM.

[31]

Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10941--10950.

[32]

Baosong Yang, Jian Li, Derek F Wong, Lidia S Chao, Xing Wang, and Zhaopeng Tu. 2019. Context-aware self-attention networks. In Proceedings of the AAAI Conference on Artificial Intelligence.

Digital Library

[33]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR.

[34]

Quanzeng You, Zhengyou Zhang, and Jiebo Luo. 2018. End-to-end convolutional semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5735--5744.

[35]

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-VIL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020).

[36]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.

Cited By

Ke XChen BYang XCai YLiu HGuo W(2025)Cross-modal independent matching network for image-text retrievalPattern Recognition10.1016/j.patcog.2024.111096159(111096)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111096
Liu YYuan XLi HTan ZHuang JXiao JLi WMo T(2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481620:8(1-28)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3664816
Pan RYang HLi CYang J(2024)Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.332764526(4912-4925)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3327645
Show More Cited By

Index Terms

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Natural language processing
      1. Information extraction

Recommendations

Sentence-level Sentiment Classification with Weak Supervision
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Sentence-level sentiment classification is important to understand users' fine-grained opinions. Existing methods for sentence-level sentiment classification are mainly based on supervised learning. However, it is difficult to obtain sentiment labels of ...
Multi-task Collaborative Network for Image-Text Retrieval
MultiMedia Modeling
Abstract
Image-text retrieval aims to capture semantic relevance between images and texts. Most existing approaches rely solely on the image-text pairs to learn visual-semantic representation through fine-grained alignments while neglecting the potential ...
Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
Abstract
Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

June 2022

714 pages

ISBN:9781450392389

DOI:10.1145/3512527

General Chairs:
Vincent Oria
New Jersey Institute of Technology, USA
,
Maria Luisa Sapino
Università degli Studi di Torino, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Brigitte Kerhervé
Université du Québec à Montréal, Canada
,
Program Chairs:
Wen-Huang Cheng
National Yang Ming Chao Tung University, Taiwan
,
Ichiro Ide
Nagoya University, Japan
,
Vivek Singh
Rutgers University, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Science and Technology Commission of Shanghai Municipality Grant
Zhejiang Lab
Natural Science Foundation of China

Conference

ICMR '22

Sponsor:

SIGMM

ICMR '22: International Conference on Multimedia Retrieval

June 27 - 30, 2022

NJ, Newark, USA

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
226
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ke XChen BYang XCai YLiu HGuo W(2025)Cross-modal independent matching network for image-text retrievalPattern Recognition10.1016/j.patcog.2024.111096159(111096)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111096
Liu YYuan XLi HTan ZHuang JXiao JLi WMo T(2024)SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366481620:8(1-28)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3664816
Pan RYang HLi CYang J(2024)Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.332764526(4912-4925)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3327645
Pu XYang PYuan LGao X(2024)Improving Image-Text Matching by Integrating Word Sense DisambiguationIEEE Signal Processing Letters10.1109/LSP.2024.346699231(2695-2699)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3466992
Fan YHe ZRen THuang CJing YZhang KWang X(2024)Metasql: A Generate-Then-Rank Framework for Natural Language to SQL Translation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00143(1765-1778)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00143
Li WMa ZDeng LWang PShi JFan XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Reservoir Computing Transformer for Image-Text RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611758(5605-5613)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611758
Li WMa ZShi JFan X(2023)The Style Transformer With Common Knowledge Optimization for Image-Text RetrievalIEEE Signal Processing Letters10.1109/LSP.2023.331087030(1197-1201)Online publication date: 2023
https://doi.org/10.1109/LSP.2023.3310870
Fu ZMao ZSong YZhang Y(2023)Learning Semantic Relationship among Instances for Image-Text Matching2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01455(15159-15168)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01455

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents