Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model's attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model's attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.
Keywords: attention mechanism; computer vision; natural language processing; sparse attention; visual question answering.