research-article

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval

Authors:

Rintaro Yanagi,

Takahiro Ogawa,

Miki HaseyamaAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 3

Article No.: 68, Pages 1 - 17

https://doi.org/10.1145/3485042

Published: 04 March 2022 Publication History

Abstract

Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-modal image-retrieval methods are convenient and accurate when users input a query text that can be used to uniquely identify the desired image. However, in reality, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain desired images. To overcome these difficulties, in this study, we propose a novel interactive cross-modal image-retrieval method based on question answering. The proposed method analyzes candidate images and asks users questions to obtain information that can narrow down retrieval candidates. By only answering questions generated by the proposed method, users can reach their desired images, even when using an ambiguous query text. Experimental results show the proposed method’s effectiveness.

References

[1]

Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352. Retrieved from https://arxiv.org/abs/2009.11352.

[2]

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 475–484.

Digital Library

[3]

Deepanwita Datta, Shubham Varma, Ravindranath Chowdary C., and Sanjay K. Singh. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications 68 (2017), 81–92.

Digital Library

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.

[5]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Google Brain Toronto, and Sanja Fidler. 2018. VSE ++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.

[6]

Bin Fu, Yunqi Qiu, Chengguang Tang, Yang Li, Haiyang Yu, and Jian Sun. 2021. A survey on complex question answering over knowledge base: Recent advances and challenges. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 4483–4491.

[7]

Giorgio Giacinto. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. 456–463.

Digital Library

[8]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.

[9]

Kehua Guo, Ruifang Zhang, Zhurong Zhou, Yayuan Tang, and Li Kuang. 2016. Combined retrieval: A convenient and precise approach for Internet image retrieval. Information Sciences 358 (2016), 151–163.

Digital Library

[10]

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Proceedings of the Advances in Neural Information Processing Systems. 678–688.

[11]

Harold Hotelling. 1992. Relations between two sets of variates. In Proceedings of the Breakthroughs in Statistics. 162–190.

[12]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.

[13]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.

[14]

Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754–5763.

[15]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.

[16]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

Digital Library

[17]

Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79–86.

[18]

Shuang Liang and Zhengxing Sun. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters 29, 12 (2008), 1733–1741.

Digital Library

[19]

Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740–755.

[20]

Wei-Chao Lin, Zong-Yao Chen, Shih-Wen Ke, Chih-Fong Tsai, and Wei-Yang Lin. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing 166, C (2015), 26–37.

Digital Library

[21]

Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.

[22]

Dan Lu, Xiaoxiao Liu, and Xueming Qian. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia 18, 8 (2016), 1628–1639.

Digital Library

[23]

Devraj Mandal and Soma Biswas. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters 98, C (2017), 110–116.

Digital Library

[24]

Xian-Ling Mao, Yi-Jing Hao, Dan Wang, and Heyan Huang. 2018. Query completion in community-based question answering search. Neurocomputing 274, C (2018), 3–7.

Digital Library

[25]

Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 1–38.

Digital Library

[26]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.

[27]

Luca Piras and Giorgio Giacinto. 2017. Information fusion in content based image retrieval: A comprehensive overview. Information Fusion 37 (2017), 50–60.

Digital Library

[28]

Lorenzo Putzu, Luca Piras, and Giorgio Giacinto. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications 79, 37 (2020), 26995–27021.

Digital Library

[29]

Xueming Qian, Dan Lu, Yaxiong Wang, Li Zhu, Yuan Yan Tang, and Meng Wang. 2017. Image re-ranking based on topic diversity. IEEE Transactions on Image Processing 26, 8 (2017), 3734–3747.

Digital Library

[30]

Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–548.

Digital Library

[31]

Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 117–126.

Digital Library

[32]

Luca Rossetto, Ralph Gasser, Jakub Lokoc, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomas Soucek, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, and Ravindranath Chowdary C. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia 23 (2020), 243–256.

[33]

Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979–1988.

[34]

Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 235–244.

Digital Library

[35]

Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Proceedings of the Advances in Neural Information Processing Systems. 2651–2661.

[36]

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1–12.

[37]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439–6448.

[38]

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.

[39]

Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia. 12–20.

Digital Library

[40]

Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, and Chunna Tian. 2020. Boosting cross-modal retrieval With MVSE++ and reciprocal neighbors. IEEE Access 8 (2020), 84642–84651.

[41]

Lei Wu, Rong Jin, and Anil K. Jain. 2013. Tag completion for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2013), 716–727.

Digital Library

[42]

Bin Xu, Jiajun Bu, Chun Chen, Can Wang, Deng Cai, and Xiaofei He. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on Knowledge and Data Engineering 27, 1 (2013), 102–114.

[43]

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5419.

[44]

Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia.

Digital Library

[45]

Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural matching models for question retrieval and next question prediction in conversation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.

[46]

Xiaojing Yu, Tianlong Chen, Yang Yang, Michael Mugo, and Zhangyang Wang. 2019. Cross-modal person search: A coarse-to-fine framework using bi-directional text-image matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1799–1804.

[47]

Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418–428.

Digital Library

[48]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.

[49]

Lei Zhang and Yong Rui. 2013. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 1 (2013), 1–20.

[50]

Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the ACM International Conference on Information and Knowledge Management. 177–186.

Digital Library

[51]

Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686–701.

[52]

Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259–9266.

Digital Library

[53]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–23.

Digital Library

[54]

Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064. Retrieved from https://arxiv.org/abs/1706.06064.

[55]

Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774. Retrieved from https://arxiv.org/abs/2101.00774.

Cited By

Guo WKong XHuang H(2025)Select & Re-RankNeurocomputing10.1016/j.neucom.2024.129003618:COnline publication date: 14-Feb-2025
https://dl.acm.org/doi/10.1016/j.neucom.2024.129003
Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Zeng XWang XXie Y(2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3637441
Show More Cited By

Index Terms

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques
      1. Text input
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search
  2. Information systems applications
    1. Multimedia information systems
      1. Multimedia databases

Recommendations

Interactive re-ranking for cross-modal retrieval based on object-wise question answering
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Cross-modal retrieval methods retrieve desired images from a query text by learning relationships between texts and images. This retrieval approach is one of the most effective ways in the easiness of query preparation. Recent cross-modal retrieval is ...
Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

We propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate ...
Human question answering performance using an interactive document retrieval system
IIIX '12: Proceedings of the 4th Information Interaction in Context Symposium

Every day, people answer their questions by using document retrieval systems. Compared to document retrieval systems, question answering (QA) systems aim to speed the rate at which users find answers by retrieving answers rather than documents. To ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 3

August 2022

478 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505208

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022

Accepted: 01 September 2021

Revised: 01 July 2021

Received: 01 April 2021

Published in TOMM Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

JSPS KAKENHI

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
486
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo WKong XHuang H(2025)Select & Re-RankNeurocomputing10.1016/j.neucom.2024.129003618:COnline publication date: 14-Feb-2025
https://dl.acm.org/doi/10.1016/j.neucom.2024.129003
Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Zeng XWang XXie Y(2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3637441
Zhang HYanagi RTogo ROgawa THaseyama M(2024)Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual promptsInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00322-y13:1Online publication date: 29-Feb-2024
https://doi.org/10.1007/s13735-024-00322-y
Lei SGong YXiao XZhou YZhang J(2023)Boosting Diversity in Visual Search with Pareto Non-Dominated Re-RankingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362529620:3(1-23)Online publication date: 10-Nov-2023
https://dl.acm.org/doi/10.1145/3625296
Li JWang YLi W(2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
https://dl.acm.org/doi/10.1145/3604284
Zhu HWei YZhao YZhang CHuang S(2023)AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358470319:6(1-22)Online publication date: 20-Feb-2023
https://dl.acm.org/doi/10.1145/3584703
Yanagi RTogo ROgawa THaseyama M(2023)Recallable Question Answering-Based Re-Ranking Considering Semantic Region for Cross-Modal RetrievalIEEE Open Journal of Signal Processing10.1109/OJSP.2023.32382804(1-11)Online publication date: 2023
https://doi.org/10.1109/OJSP.2023.3238280
Zhang HYanagi RTogo ROgawa THaseyama M(2023)Cross-Modal Image Retrieval Considering Semantic Relationships With Many-to-Many Correspondence LossIEEE Access10.1109/ACCESS.2023.323985811(10675-10686)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3239858

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents