research-article

Free access

Just Accepted

QR-CLIP: Introducing Explicit Knowledge for Location and Time Reasoning

Authors:

Zhong ZhouAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications

Accepted on 15 August 2024

https://doi.org/10.1145/3689638

Online AM: 24 August 2024 Publication History

Abstract

This paper focuses on reasoning about the location and time behind images. Given that pre-trained vision-language models (VLMs) exhibit excellent image and text understanding capabilities, most existing methods leverage them to match visual cues with location and time-related descriptions. However, these methods cannot look beyond the actual content of an image, failing to produce satisfactory reasoning results, as such reasoning requires connecting visual details with rich external cues (e.g., relevant event contexts). To this end, we propose a novel reasoning method, QR-CLIP, aim at enhancing the model's ability to reason about location and time through interaction with external explicit knowledge such as Wikipedia. Specifically, QR-CLIP consists of two modules: 1) The Quantity module abstracts the image into multiple distinct representations and uses them to search and gather external knowledge from different perspectives that are beneficial to model reasoning. 2) The Relevance module filters the visual features and the searched explicit knowledge and dynamically integrates them to form a comprehensive reasoning result. Extensive experiments demonstrate the effectiveness and generalizability of QR-CLIP. On the WikiTiLo dataset, QR-CLIP boosts the accuracy of location (country) and time reasoning by 7.03% and 2.22%, respectively, over previous SOTA methods. On the more challenging TARA dataset, it improves the accuracy for location and time reasoning by 3.05% and 2.45%, respectively. The source code is at https://github.com/Shi-Wm/QR-CLIP.

References

[1]

Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. Clip-event: connecting text and images with event structures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16420–16429.

[2]

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social media: a data mining perspective. ACM SIGKDD explorations newsletter, 19, 1, 22–36.

[3]

Shengsheng Qian, Jun Hu, Quan Fang, and Changsheng Xu. 2021. Knowledge-aware multi-modal adaptive graph convolutional networks for fake news detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 3, 1–23.

Digital Library

[4]

Lu Zhang, Jialie Shen, Jian Zhang, Jingsong Xu, Zhibin Li, Yazhou Yao, and Litao Yu. 2021. Multimodal marketing intent analysis for effective targeted advertising. IEEE Transactions on Multimedia, 24, 1830–1843.

[5]

Nengjun Zhu, Jian Cao, Kunwei Shen, Xiaosong Chen, and Siji Zhu. 2020. A decision support system with intelligent recommendation for multi-disciplinary medical treatment. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16, 1s, 1–23.

Digital Library

[6]

Le Wu, Xiangnan He, Xiang Wang, Kun Zhang, and Meng Wang. 2022. A survey on accuracy-oriented neural recommendation: from collaborative filtering to information-rich recommendation. IEEE Transactions on Knowledge and Data Engineering, 35, 5, 4425–4445.

[7]

Lu Zhang, Jingsong Xu, Yongshun Gong, Litao Yu, Jian Zhang, and Jialie Shen. 2021. Unsupervised image and text fusion for travel information enhancement. IEEE Transactions on Multimedia, 24, 1415–1425.

Digital Library

[8]

Jie Wen, Nan Jiang, Lang Li, Jie Zhou, Yanpei Li, Hualin Zhan, Guang Kou, Weihao Gu, and Jiahui Zhao. 2024. Ta-detector: a gnn-based anomaly detector via trust relationship. ACM Transactions on Multimedia Computing, Communications and Applications.

[9]

Jialie Shen and Neil Robertson. 2021. Bbas: towards large scale effective ensemble adversarial attacks against deep neural network learning. Information Sciences, 569, 469–478.

[10]

Tawfiq Salem, Scott Workman, and Nathan Jacobs. 2020. Learning a dynamic map of visual appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12435–12444.

[11]

Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li. 2020. Optimal feature transport for cross-view image geo-localization. In Proceedings of the AAAI Conference on Artificial Intelligence number 07. Vol. 34, 11990–11997.

[12]

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2020. Superglue: learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4938–4947.

[13]

Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, and Stefano Ermon. 2023. Geollm: extracting geospatial knowledge from large language models. arXiv preprint arXiv:2310.06213.

[14]

Gabriele Berton, Carlo Masone, and Barbara Caputo. 2022. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4878–4888.

[15]

Royston Rodrigues and Masahiro Tani. 2022. Global assists local: effective aerial representations for field of view constrained image geo-localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3871–3879.

[16]

Alec Radford et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[17]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: bootstrapping language-image pre-training for unified visionlanguage understanding and generation. In International conference on machine learning. PMLR, 12888–12900.

[18]

Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl Vondrick, and Dan Roth. 2022. There’s a time and place for reasoning beyond the image. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1138–1149.

[19]

Gengyuan Zhang, Yurui Zhang, Kerui Zhang, and Volker Tresp. 2024. Can vision-language models be a good guesser? exploring vlms for times and location reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 636–645.

[20]

Edwin Hutchins. 1991. The social organization of distributed cognition. American Psychological Association.

[21]

E. Hutchins. 1995. Cognition in the Wild. MIT press.

[22]

Edwin Hutchins. 2000. Distributed cognition. Elsevier Science.

[23]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.

[24]

Alexey Dosovitskiy et al. 2020. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[25]

Lisi Chen and Shuo Shang. 2019. Region-based message exploration over spatio-temporal data streams. In Proceedings of the AAAI Conference on Artificial Intelligence number 01. Vol. 33, 873–880.

Digital Library

[26]

Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2022. Temporal common sense acquisition with minimal supervision. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

[27]

Qiang Ning, Ben Zhou, Hao Wu, Haoruo Peng, Chuchu Fan, and Matt Gardner. 2022. A meta-framework for spatiotemporal quantity extraction from text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2736–2749.

[28]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713.

[29]

Xiujie Song, Mengyue Wu, Kenny Q Zhu, Chunhao Zhang, and Yanyi Chen. 2024. A cognitive evaluation benchmark of image reasoning and description for large vision language models. arXiv preprint arXiv:2402.18409.

[30]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 21466–21474.

[31]

Hyeongjun Kwon, Taeyong Song, Somi Jeong, Jin Kim, Jinhyun Jang, and Kwanghoon Sohn. 2023. Probabilistic prompt learning for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6768–6777.

[32]

Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anagnostidis, Gregor Bachmann, and Thomas Hofmann. 2023. Clip-guided vision-language pre-training for question answering in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5606–5611.

[33]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.

[34]

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Tip-adapter: training-free adaption of clip for few-shot classification. In European conference on computer vision. Springer, 493–510.

Digital Library

[35]

Chao Jia et al. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.

[36]

Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. 2022. Slip: self-supervision meets language-image pre-training. In European conference on computer vision. Springer, 529–544.

Digital Library

[37]

Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, and Chunhua Shen. 2022. Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Advances in neural information processing systems, 35, 35959–35970.

[38]

Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. 2023. Crowdclip: unsupervised crowd counting via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2893–2903.

[39]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

[40]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In ICCV.

[41]

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in neural information processing systems, 33, 6827–6839.

[42]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000–16009.

[43]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.

[44]

Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, and Nan Duan. 2022. Multi-view document representation learning for open-domain dense retrieval. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

[45]

Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, and Roshni Ramnani. 2022. Cofar: commonsense and factual reasoning in image search. In AACL-IJCNLP.

[46]

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. Wit: wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, 2443–2449.

Digital Library

[47]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.

[48]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[49]

Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. 2022. A contrastive framework for neural text generation. Advances in Neural Information Processing Systems, 35, 21548–21561.

[50]

Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. 2022. Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.

Index Terms

QR-CLIP: Introducing Explicit Knowledge for Location and Time Reasoning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Knowledge representation and reasoning
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Information retrieval diversity

Recommendations

Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning
The Semantic Web
Abstract
Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual ...
Recurrent Vision Transformer for Solving Visual Reasoning Problems
Image Analysis and Processing – ICIAP 2022
Abstract
Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision,...
CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge Graph
Object detection is a widely studied problem in existing works. However, in this paper, we turn to a more challenging problem of “Covered Object Reasoning”, aimed at reasoning the category label of target object in the given image particularly when it has ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted

EISSN:1551-6865

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 24 August 2024

Accepted: 15 August 2024

Revised: 29 June 2024

Received: 15 December 2023

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
63
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)63

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables