Feature Enhancement in Attention for Visual Question Answering

Feature Enhancement in Attention for Visual Question Answering

Yuetan Lin, Zhangyang Pang, Donghui Wang, Yueting Zhuang

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 4216-4222. https://doi.org/10.24963/ijcai.2018/586

Attention mechanism has been an indispensable part of Visual Question Answering (VQA) models, due to the importance of its selective ability on image regions and/or question words. However, attention mechanism in almost all the VQA models takes as input the image visual and question textual features, which stem from different sources and between which there exists essential semantic gap. In order to further improve the accuracy of correlation between region and question in attention, we focus on region representation and propose the idea of feature enhancement, which includes three aspects. (1) We propose to leverage region semantic representation which is more consistent with the question representation. (2) We enrich the region representation using features from multiple hierarchies and (3) we refine the semantic representation for richer information. With these three incremental feature enhancement mechanisms, we improve the region representation and achieve better attentive effect and VQA performance. We conduct extensive experiments on the largest VQA v2.0 benchmark dataset and achieve competitive results without additional training data, and prove the effectiveness of our proposed feature-enhanced attention by visual demonstrations.
Keywords:
Natural Language Processing: Question Answering
Robotics: Robotics and Vision
Computer Vision: Language and Vision
Knowledge Representation and Reasoning: Knowledge Representation and Reasoning for Game Playing
Machine Learning Applications: Other Applications
Computer Vision: Computer Vision