Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Interactive Re-ranking via Object Entropy-Guided Question Answering for Cross-Modal Image Retrieval

Published: 04 March 2022 Publication History

Abstract

Cross-modal image-retrieval methods retrieve desired images from a query text by learning relationships between texts and images. Such a retrieval approach is one of the most effective ways of achieving the easiness of query preparation. Recent cross-modal image-retrieval methods are convenient and accurate when users input a query text that can be used to uniquely identify the desired image. However, in reality, users frequently input ambiguous query texts, and these ambiguous queries make it difficult to obtain desired images. To overcome these difficulties, in this study, we propose a novel interactive cross-modal image-retrieval method based on question answering. The proposed method analyzes candidate images and asks users questions to obtain information that can narrow down retrieval candidates. By only answering questions generated by the proposed method, users can reach their desired images, even when using an ambiguous query text. Experimental results show the proposed method’s effectiveness.

References

[1]
Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352. Retrieved from https://arxiv.org/abs/2009.11352.
[2]
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 475–484.
[3]
Deepanwita Datta, Shubham Varma, Ravindranath Chowdary C., and Sanjay K. Singh. 2017. Multimodal retrieval using mutual information based textual query reformulation. Expert Systems with Applications 68 (2017), 81–92.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[5]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Google Brain Toronto, and Sanja Fidler. 2018. VSE ++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.
[6]
Bin Fu, Yunqi Qiu, Chengguang Tang, Yang Li, Haiyang Yu, and Jian Sun. 2021. A survey on complex question answering over knowledge base: Recent advances and challenges. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 4483–4491.
[7]
Giorgio Giacinto. 2007. A nearest-neighbor approach to relevance feedback in content based image retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. 456–463.
[8]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.
[9]
Kehua Guo, Ruifang Zhang, Zhurong Zhou, Yayuan Tang, and Li Kuang. 2016. Combined retrieval: A convenient and precise approach for Internet image retrieval. Information Sciences 358 (2016), 151–163.
[10]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In Proceedings of the Advances in Neural Information Processing Systems. 678–688.
[11]
Harold Hotelling. 1992. Relations between two sets of variates. In Proceedings of the Breakthroughs in Statistics. 162–190.
[12]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.
[13]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.
[14]
Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754–5763.
[15]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.
[16]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
[17]
Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79–86.
[18]
Shuang Liang and Zhengxing Sun. 2008. Sketch retrieval and relevance feedback with biased SVM classification. Pattern Recognition Letters 29, 12 (2008), 1733–1741.
[19]
Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE European Conference on Computer Vision. 740–755.
[20]
Wei-Chao Lin, Zong-Yao Chen, Shih-Wen Ke, Chih-Fong Tsai, and Wei-Yang Lin. 2015. The effect of low-level image features on pseudo relevance feedback. Neurocomputing 166, C (2015), 26–37.
[21]
Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.
[22]
Dan Lu, Xiaoxiao Liu, and Xueming Qian. 2016. Tag-based image search by social re-ranking. IEEE Transactions on Multimedia 18, 8 (2016), 1628–1639.
[23]
Devraj Mandal and Soma Biswas. 2017. Query specific re-ranking for improved cross-modal retrieval. Pattern Recognition Letters 98, C (2017), 110–116.
[24]
Xian-Ling Mao, Yi-Jing Hao, Dan Wang, and Heyan Huang. 2018. Query completion in community-based question answering search. Neurocomputing 274, C (2018), 3–7.
[25]
Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3 (2014), 1–38.
[26]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.
[27]
Luca Piras and Giorgio Giacinto. 2017. Information fusion in content based image retrieval: A comprehensive overview. Information Fusion 37 (2017), 50–60.
[28]
Lorenzo Putzu, Luca Piras, and Giorgio Giacinto. 2020. Convolutional neural networks for relevance feedback in content based image retrieval. Multimedia Tools and Applications 79, 37 (2020), 26995–27021.
[29]
Xueming Qian, Dan Lu, Yaxiong Wang, Li Zhu, Yuan Yan Tang, and Meng Wang. 2017. Image re-ranking based on topic diversity. IEEE Transactions on Image Processing 26, 8 (2017), 3734–3747.
[30]
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–548.
[31]
Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 117–126.
[32]
Luca Rossetto, Ralph Gasser, Jakub Lokoc, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomas Soucek, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, and Ravindranath Chowdary C. 2020. Interactive video retrieval in the age of deep learning-detailed evaluation of vbs 2019. IEEE Transactions on Multimedia 23 (2020), 243–256.
[33]
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1979–1988.
[34]
Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 235–244.
[35]
Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, and Vicente Ordonez. 2019. Drill-down: Interactive retrieval of complex scenes using natural language queries. In Proceedings of the Advances in Neural Information Processing Systems. 2651–2661.
[36]
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations. 1–12.
[37]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6439–6448.
[38]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.
[39]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia. 12–20.
[40]
Wei Wei, Mengmeng Jiang, Xiangnan Zhang, Heng Liu, and Chunna Tian. 2020. Boosting cross-modal retrieval With MVSE++ and reciprocal neighbors. IEEE Access 8 (2020), 84642–84651.
[41]
Lei Wu, Rong Jin, and Anil K. Jain. 2013. Tag completion for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 3 (2013), 716–727.
[42]
Bin Xu, Jiajun Bu, Chun Chen, Can Wang, Deng Cai, and Xiaofei He. 2013. EMR: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on Knowledge and Data Engineering 27, 1 (2013), 102–114.
[43]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5419.
[44]
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2021. Interactive re-ranking for cross-modal retrieval based on object-wise question answering. In Proceedings of the ACM International Conference on Multimedia in Asia.
[45]
Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, and W. Bruce Croft. 2017. Neural matching models for question retrieval and next question prediction in conversation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.
[46]
Xiaojing Yu, Tianlong Chen, Yang Yang, Michael Mugo, and Zhangyang Wang. 2019. Cross-modal person search: A coarse-to-fine framework using bi-directional text-image matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1799–1804.
[47]
Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020. 418–428.
[48]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.
[49]
Lei Zhang and Yong Rui. 2013. Image search—from thousands to billions in 20 years. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 1 (2013), 1–20.
[50]
Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the ACM International Conference on Information and Knowledge Management. 177–186.
[51]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the IEEE European Conference on Computer Vision. 686–701.
[52]
Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. 2019. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9259–9266.
[53]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–23.
[54]
Wengang Zhou, Houqiang Li, and Qi Tian. 2017. Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064. Retrieved from https://arxiv.org/abs/1706.06064.
[55]
Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774. Retrieved from https://arxiv.org/abs/2101.00774.

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
  • (2024)Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual promptsInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00322-y13:1Online publication date: 29-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
August 2022
478 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505208
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022
Accepted: 01 September 2021
Revised: 01 July 2021
Received: 01 April 2021
Published in TOMM Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cross-modal image retrieval
  2. re-ranking
  3. question answering

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • JSPS KAKENHI

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)8
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
  • (2024)Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual promptsInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00322-y13:1Online publication date: 29-Feb-2024
  • (2023)Boosting Diversity in Visual Search with Pareto Non-Dominated Re-RankingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362529620:3(1-23)Online publication date: 10-Nov-2023
  • (2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
  • (2023)AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358470319:6(1-22)Online publication date: 20-Feb-2023
  • (2023)Recallable Question Answering-Based Re-Ranking Considering Semantic Region for Cross-Modal RetrievalIEEE Open Journal of Signal Processing10.1109/OJSP.2023.32382804(1-11)Online publication date: 2023
  • (2023)Cross-Modal Image Retrieval Considering Semantic Relationships With Many-to-Many Correspondence LossIEEE Access10.1109/ACCESS.2023.323985811(10675-10686)Online publication date: 2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media