Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3512527.3531374acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Cross-Modal Retrieval between Event-Dense Text and Image

Published: 27 June 2022 Publication History

Abstract

This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.

Supplementary Material

MP4 File (ICMR22-fp109.mp4)
This paper introduces the task of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. Our proposed approach incorporates event-oriented features to enhance the cross-modal alignment. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval.

References

[1]
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101--mining discriminative components with random forests. In European conference on computer vision. Springer, 446--461.
[2]
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 35--44.
[3]
Jingjing Chen, Lei Pang, and Chong-Wah Ngo. 2017. Cross-modal recipe retrieval: How to cook this dish?. In International Conference on Multimedia Modeling. Springer, 588--600.
[4]
Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the 26th ACM international conference on Multimedia. 1020--1028.
[5]
Ying Chen, Dong Zhou, Lin Li, and Junmei Han. 2021. Multimodal Encoders for Food-Oriented Cross-Modal Retrieval. In Web and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, August 23--25, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12859), Leong Hou U, Marc Spaniol, Yasushi Sakurai, and Junying Chen (Eds.). Springer, 253--266.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net.
[8]
Mikhail Fain, Andrey Ponikar, Ryan Fox, and Danushka Bollegala. 2019. Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA. CoRR abs/1911.12763 (2019).
[9]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03 - 07, 2014, Kien A. Hua, Yong Rui, Ralf Steinmetz, Alan Hanjalic, Apostol Natsev, and Wenwu Zhu (Eds.). ACM, 7--16.
[10]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7--16.
[11]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems. 2121--2129.
[12]
Han Fu, Rui Wu, Chenghao Liu, and Jianling Sun. 2020. MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14570--14580.
[13]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.
[14]
Ricardo Guerrero, Hai Xuan Pham, and Vladimir Pavlovic. 2021. Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (Eds.). ACM, 3192--3201.
[15]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767--5777.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[17]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
[18]
Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. Springer, 162--190.
[19]
Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. 2003. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. 119--126.
[20]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[21]
Deepika Kumawat and Vinesh Jain. 2015. POS tagging approaches: A comparison. International Journal of Computer Applications 118, 6 (2015).
[22]
Jiao Li, Jialiang Sun, Xing Xu, Wei Yu, and Fumin Shen. 2021. Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion. In ICMR '21: International Conference on Multimedia Retrieval, Taipei, Taiwan, August 21- 24, 2021, Wen-Huang Cheng, Mohan S. Kankanhalli, Meng Wang, Wei-Ta Chu, Jiaying Liu, and Marcel Worring (Eds.). ACM, 173--182.
[23]
Jiao Li, Xing Xu, Wei Yu, Fumin Shen, Zuo Cao, Kai Zuo, and Heng Tao Shen. 2021. Hybrid Fusion with Intra- and Cross-Modality Attention for Image-Recipe Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11--15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 244--254.
[24]
Yen-Chieh Lien, Hamed Zamani, and W Bruce Croft. 2020. Recipe Retrieval with Visual Query of Ingredients. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1565--1568.
[25]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 740--755.
[26]
Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, and Min Yang. 2019. Cross-modal Image-Text Retrieval with Multitask Learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3--7, 2019, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 2309--2312.
[27]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[28]
Hai Xuan Pham, Ricardo Guerrero, Vladimir Pavlovic, and Jiatong Li. 2021. CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 2423--2430.
[29]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 2641--2649.
[30]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 251--260.
[31]
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20--25 June 2011. IEEE Computer Society, 1745--1752.
[32]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513--523.
[33]
Amaia Salvador, Erhan Gundogdu, Loris Bazzani, and Michael Donoser. 2021. Revamping Cross-Modal Recipe Retrieval With Hierarchical Transformers and Self-Supervised Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 15475--15484.
[34]
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3020--3028.
[35]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial Representation Learning for Text-to-Image Matching. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 5813--5823.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[37]
Yu Sugiyama and Keiji Yanai. 2021. Cross-Modal Recipe Embeddings by Disentangling Recipe Contents and Dish Styles. In MM '21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R. Smith, Yang Yang, Pablo Cesar, Florian Metze, and Balakrishnan Prabhakaran (Eds.). ACM, 2501--2509.
[38]
Aixin Sun, Sourav S Bhowmick, Khanh Tran Nam Nguyen, and Ge Bai. 2011. Tag-based social image retrieval: An empirical evaluation. Journal of the American Society for Information Science and Technology 62, 12 (2011), 2364--2381.
[39]
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26--31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer Linguistics, 1556--1566.
[40]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia. 154--162.
[41]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and Steven CH Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11572--11581.
[42]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-Peng Lim, and C.H. Steven Hoi. 2021. Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism. IEEE Transactions on Multimedia (2021), 1--1. https://doi.org/10.1109/TMM.2021.3083109
[43]
Wei Wang, Junyu Gao, Xiaoshan Yang, and Changsheng Xu. 2021. Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Trans. Multim. 23 (2021), 2386--2397.
[44]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 5763--5772.
[45]
Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling Up to Large Vocabulary Image Annotation. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16--22, 2011, Toby Walsh (Ed.). IJCAI/AAAI, 2764--2770.
[46]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492--1500.
[47]
Zhongwei Xie, Lin Li, Xian Zhong, Yang He, and Luo Zhong. 2019. Enhancing multimodal deep representation learning by fixed model reuse. Multim. Tools Appl. 78, 21 (2019), 30769--30791.
[48]
Zhongwei Xie, Ling Liu, Lin Li, and Luo Zhong. 2021. Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, Gianluca Demartini, Guido Zuccon, J. Shane Culpepper, Zi Huang, and Hanghang Tong (Eds.). ACM, 2221--2230.
[49]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 3441--3450.
[50]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441--3450.
[51]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
[52]
Zichen Zan, Lin Li, Jianquan Liu, and Dong Zhou. 2020. Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images. In Proceedings of the 2020 on International Conference on Multimedia Retrieval, ICMR 2020, Dublin, Ireland, June 8--11, 2020, Cathal Gurrin, Björn Þór Jónsson, Noriko Kando, Klaus Schöffmann, Yi-Ping Phoebe Chen, and Noel E. O'Connor (Eds.). ACM, 117--125.
[53]
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2gan: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11477--11486.

Cited By

View all
  • (2024)Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe RetrievalFoods10.3390/foods1311162813:11(1628)Online publication date: 23-May-2024
  • (2024)Boosting Healthiness Exposure in Category-constrained Meal Recommendation Using Nutritional StandardsACM Transactions on Intelligent Systems and Technology10.1145/3643859Online publication date: 5-Feb-2024
  • (2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. event-dense
  3. modality alignment

Qualifiers

  • Research-article

Funding Sources

  • IBM faculty award
  • China Scholarship Council
  • the Key R \& D project of Hubei Province
  • the USA National Science Foundation

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)4
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe RetrievalFoods10.3390/foods1311162813:11(1628)Online publication date: 23-May-2024
  • (2024)Boosting Healthiness Exposure in Category-constrained Meal Recommendation Using Nutritional StandardsACM Transactions on Intelligent Systems and Technology10.1145/3643859Online publication date: 5-Feb-2024
  • (2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
  • (2022)GSAIC: GeoScience Articles Illustration and Caption DatasetHighlights in Science, Engineering and Technology10.54097/hset.v9i.18589(289-297)Online publication date: 30-Sep-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media