Authors:
Tung Le
1
;
Khoa Pho
1
;
Thong Bui
2
;
3
;
Huy Tien Nguyen
2
;
3
and
Minh Le Nguyen
1
Affiliations:
1
School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan
;
2
Faculty of Information Technology, University of Science, Ho Chi Minh city, Vietnam
;
3
Vietnam National University, Ho Chi Minh city, Vietnam
Keyword(s):
Visual Question Classification, Object-less Image, Vision-language Model, Vision Transformer, VizWiz-VQA.
Abstract:
Despite the long-standing appearance of question types in the Visual Question Answering dataset, Visual Question Classification does not received enough public interest in research. Different from general text classification, a visual question requires an understanding of visual and textual features simultaneously. Together with the enthusiasm and novelty of Visual Question Classification, the most important and practical goal we concentrate on is to deal with the weakness of Object Detection on object-less images. We thus propose an Object-less Visual Question Classification model, OL–LXMERT, to generate virtual objects replacing the dependence of Object Detection in previous Vision-Language systems. Our architecture is effective and powerful enough to digest local and global features of images in understanding the relationship between multiple modalities. Through our experiments in our modified VizWiz-VQC 2020 dataset of blind people, our Object-less LXMERT achieves promising resul
ts in the brand-new multi-modal task. Furthermore, the detailed ablation studies show the strength and potential of our model in comparison to competitive approaches.
(More)