Computer Science > Computation and Language

arXiv:2309.17133 (cs)

[Submitted on 29 Sep 2023 (v1), last revised 28 Oct 2023 (this version, v2)]

Title:Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Authors:Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne

View PDF

Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.

Comments:	To appear at NeurIPS 2023. This is the camera-ready version. We fixed some numbers and added more experiments to address reviewers' comments
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.17133 [cs.CL]
	(or arXiv:2309.17133v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.17133

Submission history

From: Weizhe Lin [view email]
[v1] Fri, 29 Sep 2023 10:54:10 UTC (2,365 KB)
[v2] Sat, 28 Oct 2023 16:03:35 UTC (2,367 KB)

Computer Science > Computation and Language

Title:Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators