Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Published: 01 January 2023 Publication History

Abstract

Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, <italic>e.g.</italic>, via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2&#x0025;&#x2212;4&#x0025; (16.9&#x0025;&#x2212;35.3&#x0025; for baseline SCAN), and reducing the retrieval time by 50&#x0025;&#x2212;73&#x0025;.

References

[1]
L. Zhanget al., “Deep top-$ k$ ranking for image-sentence matching,”IEEE Trans. Multimedia, vol. 22, pp. 775–785, 2019.
[2]
L. Wang, Y. Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 394–407, Feb.2019.
[3]
J. Yuet al., “Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval,”IEEE Trans. Multimedia, vol. 22, no. 12, pp. 3196–3209, Dec.2020.
[4]
C. Liuet al., “Focus your attention: A focal attention for multimodal learning,”IEEE Trans. Multimedia, to be published.
[5]
Y. Liu, X. Wang, Y. Yuan, and W. Zhu, “Cross-modal dual learning for sentence-to-video generation,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 1239–1247.
[6]
J. H. Tan, C. S. Chan, and J. H. Chuah, “Comic: Toward a compact image captioning model with attention,”IEEE Trans. Multimedia, vol. 21, pp. 2686–2696, 2019.
[7]
J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Video storytelling: Textual summaries for events,”IEEE Trans. Multimedia, vol. 22, pp. 554–565, 2019.
[8]
M. Jiang, S. Chen, J. Yang, and Q. Zhao, “Fantastic answers and where to find them: Immersive question-directed visual attention,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2980–2989.
[9]
J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang, “Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7181–7189.
[10]
Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew, “Learning a recurrent residual fusion network for multimodal matching,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 4107–4116.
[11]
L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 2623–2631.
[12]
Y. Wu, S. Wang, and Q. Huang, “Learning semantic structure-preserved embeddings for cross-modal retrieval,” in Proc. Int. Conf. Comput. Vis., 2018, pp. 825–833.
[13]
L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in Proc Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5005–5013.
[14]
F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proc. Brit. Mach. Vision Conf., 2018, pp. 1–13.
[15]
N. Sarafianos, X. Xu, and I. A. Kakadiaris, “Adversarial representation learning for text-to-image matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5814–5824.
[16]
A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Neural Inf. Process. Syst., 2014, pp. 1889–1897.
[17]
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.
[18]
Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical multimodal LSTM for dense visual-semantic embedding,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 1881–1889.
[19]
K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 201–216.
[20]
C. Liuet al., “Focus your attention: A bidirectional focal attention network for image-text matching,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 3–11.
[21]
S. Wang, R. Wang, Z. Yao, S. Shan, and X. Chen, “Cross-modal scene graph matching for relationship-aware image-text retrieval,” in Proc. Winter Conf. Appl. Comput. Vis., 2020, pp. 1508–1517.
[22]
Y. Wanget al., “Position focused attention network for image-text matching,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 3792–3798.
[23]
Y. Wu, S. Wang, G. Song, and Q. Huang, “Learning fragment self-attention embeddings for image-text matching,” in Proc. ACM Int. Conf. Multimedia, 2019, pp. 2088–2096.
[24]
T. Wanget al., “Matching images and text with multi-modal tensor fusion and re-ranking,” in Proc. Int. Conf. Multimedia, 2019, pp. 12–20.
[25]
C. Liuet al., “Graph structured network for image-text matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10921–10930.
[26]
H. Chenet al., “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12655–12663.
[27]
T. Chen and J. Luo, “Expressing objects just like words: Recurrent visual embedding for image-text matching,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 07, pp. 10583–10590.
[28]
Z. Hu, Y. Luo, J. Lin, Y. Yan, and J. Chen, “Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 789–795.
[29]
F. Huang, X. Zhang, Z. Zhao, and Z. Li, “Bi-directional spatial-semantic attention networks for image-text matching,”IEEE Trans. Image Process., vol. 28, no. 4, pp. 2008–2020, Apr.2019.
[30]
Z. Ji, H. Wang, J. Han, and Y. Pang, “Saliency-guided attention network for image-sentence matching,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 5754–5763.
[31]
Y. Huang, Q. Wu, C. Song, and L. Wang, “Learning semantic concepts and order for image and sentence matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6163–6171.
[32]
X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,”IEEE Trans. Multimedia, vol. 21, pp. 2117–2130, 2019.
[33]
P. Andersonet al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.
[34]
W. Zhanget al., “Frame augmented alternating attention network for video question answering,”IEEE Trans. Multimedia, vol. 22, pp. 1032–1041, 2019.
[35]
J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph R-CNN for scene graph generation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 670–685.
[36]
H. Nam, J.-W. Ha, and J. Kim, “Dual attention networks for multimodal reasoning and matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 299–307.
[37]
X. Yang, C. Deng, Z. Dang, K. Wei, and J. Yan, “SelfSAGCN: Self-supervised semantic alignment for graph convolution network,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16775–16784.
[38]
X. Li, Z. Xu, K. Wei, and C. Deng, “Generalized zero-shot learning via disentangled representation,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 3, pp. 1966–1974.
[39]
K. Wei, M. Yang, H. Wang, C. Deng, and X. Liu, “Adversarial fine-grained composition learning for unseen attribute-object recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3741–3749.
[40]
Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 686–701.
[41]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,”Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004.
[42]
C. Deng, Z. Chen, X. Liu, X. Gao, and D. Tao, “Triplet-based deep hashing network for cross-modal retrieval,”IEEE Trans. Image Process., vol. 27, no. 8, pp. 3893–3903, Aug.2018.
[43]
T. Chen, J. Deng, and J. Luo, “Adaptive offline quintuplet loss for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 549–565.
[44]
J. Weiet al., “Universal weighting metric learning for cross-modal matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13005–13014.
[45]
J. Wei, Y. Yang, X. Xu, X. Zhu, and H. T. Shen, “Universal weighting metric learning for cross-modal retrieval,”IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[46]
K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 4654–4662.
[47]
X. Xu, K. Lin, Y. Yang, A. Hanjalic, and H. T. Shen, “Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited,”IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[48]
S. Yan, L. Yu, and Y. Xie, “Discrete-continuous action space policy gradient-based attention for image-text matching,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2021, pp. 8096–8105.
[49]
L. Qu, M. Liu, J. Wu, Z. Gao, and L. Nie, “Dynamic modality interaction modeling for image-text retrieval,” in Proc. Conf. Res. Develop. Inf. Retrieval, 2021, pp. 1104–1113.
[50]
R. Krishnaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” in Proc. Int. J. Comput. Vis., 2017, vol. 123, no. 1, pp. 32–73.
[51]
B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation,”2014, arXiv:1411.7399.
[52]
Z. Wanget al., “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 5764–5773.
[53]
B. A. Plummeret al., “Flickr30 K entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 2641–2649.
[54]
T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[55]
G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 07, pp. 11336–11344.
[56]
Y. Song and M. Soleymani, “Polysemous visual-semantic embedding for cross-modal retrieval,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1979–1988.

Cited By

View all
  • (2024)A method for image–text matching based on semantic filtering and adaptive adjustmentJournal on Image and Video Processing10.1186/s13640-024-00639-y2024:1Online publication date: 29-Aug-2024
  • (2024)CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368880120:12(1-21)Online publication date: 20-Sep-2024
  • (2024)Image-text Retrieval with Main Semantics ConsistencyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679619(2629-2638)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 25, Issue
2023
8932 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A method for image–text matching based on semantic filtering and adaptive adjustmentJournal on Image and Video Processing10.1186/s13640-024-00639-y2024:1Online publication date: 29-Aug-2024
  • (2024)CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368880120:12(1-21)Online publication date: 20-Sep-2024
  • (2024)Image-text Retrieval with Main Semantics ConsistencyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679619(2629-2638)Online publication date: 21-Oct-2024
  • (2024)Estimating the Semantics via Sector Embedding for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.340766426(10342-10353)Online publication date: 1-Jan-2024
  • (2024)A Mutually Textual and Visual Refinement Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2024.336996826(7555-7566)Online publication date: 26-Feb-2024
  • (2024)Joint Intra & Inter-Grained Reasoning: A New Look Into Semantic Consistency of Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.332764526(4912-4925)Online publication date: 1-Jan-2024
  • (2024)Feature First: Advancing Image-Text Retrieval Through Improved Visual FeaturesIEEE Transactions on Multimedia10.1109/TMM.2023.331607726(3827-3841)Online publication date: 1-Jan-2024
  • (2024)Hierarchical Local-Global Transformer for Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330955126(3263-3277)Online publication date: 1-Jan-2024
  • (2024)Commonsense-Guided Semantic and Relational Consistencies for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.328975326(1867-1880)Online publication date: 1-Jan-2024
  • (2024)Integrating Language Guidance Into Image-Text Matching for Correcting False NegativesIEEE Transactions on Multimedia10.1109/TMM.2023.326144326(103-116)Online publication date: 1-Jan-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media