research-article

Cross-Modal Retrieval between Event-Dense Text and Image

Authors:

Zhongwei Xie,

Lin Li,

Luo Zhong,

Jianquan Liu,

Ling LiuAuthors Info & Claims

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Pages 229 - 238

https://doi.org/10.1145/3512527.3531374

Published: 27 June 2022 Publication History

Get Access

Abstract

This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.

Supplementary Material

MP4 File (ICMR22-fp109.mp4)

This paper introduces the task of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. Our proposed approach incorporates event-oriented features to enhance the cross-modal alignment. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval.

Download
51.18 MB

References

[1]

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101--mining discriminative components with random forests. In European conference on computer vision. Springer, 446--461.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval

Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Multimodal Encoders for Food-Oriented Cross-Modal Retrieval

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations