NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Authors

Tianwen Qian Academy for Engineering and Technology, Fudan University
Jingjing Chen Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
Linhai Zhuo Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
Yang Jiao Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
Yu-Gang Jiang Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University

DOI:

https://doi.org/10.1609/aaai.v38i5.28253

Keywords:

CV: Vision for Robotics & Autonomous Driving, ML: Multimodal Learning, CV: Language and Vision, CV: Multi-modal Vision

Abstract

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

AAAI-24 / IAAI-24 / EAAI-24 Proceedings Cover

Downloads

Published

2024-03-24

How to Cite

Qian, T., Chen, J., Zhuo, L., Jiao, Y., & Jiang, Y.-G. (2024). NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4542-4550. https://doi.org/10.1609/aaai.v38i5.28253

Download Citation

Issue

Vol. 38 No. 5: AAAI-24 Technical Tracks 5

Section

AAAI Technical Track on Computer Vision IV