research-article

Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization

Authors:

Zhi Yu,

Wei WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4600 - 4609

https://doi.org/10.1145/3474085.3475619

Published: 17 October 2021 Publication History

Get Access

Abstract

Image retrieval with text feedback is an emerging research topic with the objective of integrating inputs from multiple modalities as queries. In this paper, queries contain a reference image plus text feedback that describes modifications between this image and the desired image. The existing work for this task mainly focuses on designing a new fusion network to compose the image and text. Still, little research pays attention to the modality gap caused by the inconsistent distribution of features from different modalities, which dramatically influences the feature fusion and similarity learning between queries and the desired image. We propose a Distribution-Aligned Text-based Image Retrieval (DATIR) model, which consists of attention mutual information maximization and hierarchical mutual information maximization, to bridge this gap by increasing non-linear statistic dependencies between representations of different modalities. More specifically, attention mutual information maximization narrows the modality gap between different input modalities by maximizing mutual information between the text representation and its semantically consistent representation captured from the reference image and the desired image by the difference transformer. For hierarchical mutual information maximization, it aligns distributions of features from the image modality and the fusion modality by estimating mutual information between a single-layer representation in the fusion network and the multi-level representations in the desired image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate that we can bridge the modality gap between different modalities and achieve state-of-the-art retrieval performance.

Supplementary Material

RAR File (mfp2476aux.rar)

In this supplementary, we provide the parameter settings for the compared methods and the qualitative results and analysis of our method on three standard benchmark datasets.

Download
622.47 KB

References

[1]

Kenan E Ak, Ashraf A Kassim, Joo Hwee Lim, and Jo Yew Tham. 2018. Learning attribute representations with localization for flexible fashion search. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7708--7717.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Relevance Feedback and Learning in Content-Based Image Search

The effect of low-level image features on pseudo relevance feedback

Cross-modal image retrieval with deep mutual information maximization

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations