NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Authors

  • Tianwen Qian Academy for Engineering and Technology, Fudan University
  • Jingjing Chen Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
  • Linhai Zhuo Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
  • Yang Jiao Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University
  • Yu-Gang Jiang Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University

DOI:

https://doi.org/10.1609/aaai.v38i5.28253

Keywords:

CV: Vision for Robotics & Autonomous Driving, ML: Multimodal Learning, CV: Language and Vision, CV: Multi-modal Vision

Abstract

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

Published

2024-03-24

How to Cite

Qian, T., Chen, J., Zhuo, L., Jiao, Y., & Jiang, Y.-G. (2024). NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4542-4550. https://doi.org/10.1609/aaai.v38i5.28253

Issue

Section

AAAI Technical Track on Computer Vision IV