Overcoming language priors in VQA via adding visual module

Jia Zhao ORCID: orcid.org/0000-0002-7440-0109^1,2,
Xuesong Zhang¹,
Xuefeng Wang¹,
Ying Yang¹ &
…
Gang Sun¹

809 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Visual Question Answering (VQA) is a new and popular research direction. Dealing with language prior problems has become a hot topic in VQA in the past two years. With the development of technologies relating to VQA, related scholars have realized that in VQA tasks, the generation of answers relies too much on language priors and considers less with visual content. Some of the previous methods to alleviate language priors only focus on processing the question, while the methods to increase visual acuity only concentrate on finding the correct region. To better overcome the language prior problem of VQA, we propose a method that will improve visual content further to enhance the impact of visual content on answers. Our method consists of three parts: the base model branch, the question-only model branch, and the visual model branch. Many experiments have been carried out on the three datasets VQA-CP v1, VQA-CP v2, and VQA v2, which proves the effectiveness of our method and further improves the accuracy of the different models. Our code is available in GitHub (https://github.com/shonnon-zxs/AddingVisualModule).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Article Google Scholar
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Ben-Younes H, Cadene R, Thome N, Cord M (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8102–8109
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Proceedings of the 30th international conference on neural information processing systems, pp 289–297
Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648
Chen C, Han D, Wang J (2020) Multimodal encoder–decoder attention networks for visual question answering. IEEE Access 8:35662–35671
Article Google Scholar
Narasimhan M, Lazebnik S, Schwing AG (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. In: Proceedings of the 32nd international conference on neural information processing systems, pp 2659–2670
Zhu Z, Yu J, Wang Y, Sun Y, Hu Y, Wu Q (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence main track, pp 1097–1103. https://doi.org/10.24963/ijcai.2020/153
Wang P, Wu Q, Shen C, Dick A, Van Den Hengel A (2017) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Article Google Scholar
Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322
Huang Q, Wei J, Cai Y, Zheng C, Chen J, Leung H F, Li Q (2020) Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7166–7176
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913
Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1955–1960
Jabri A, Joulin A, Van Der Maaten L (2016) Revisiting visual question answering baselines. In: Proceedings of the European conference on computer vision. Springer, Cham, pp 727–739
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vision Image Underst 163:3–20
Article Google Scholar
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5014–5022
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167
Selvaraju RR, Lee S, Shen Y, Jin H, Ghosh S, Heck L, Batra D, Parikh D (2019) Taking a hint: leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2591–2600
Wu J, Mooney RJ (2019) Self-critical reasoning for robust visual question answering. arXiv preprint arXiv:1905.09998
Shrestha R, Kafle K, Kanan C (2020) A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8172–8181
Clark C, Yatskar M, Zettlemoyer L (2019) Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language Processing (EMNLP-IJCNLP), pp 4060–4073
Niu Y, Tang K, Zhang H, Lu Z, Hua X S, Wen J R (2020) Counterfactual vqa: a cause-effect look at language bias. arXiv preprint arXiv:2006.04315
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980
Liang Z, Jiang W, Hu H, Zhu J (2020) Learning to contrast the counterfactual samples for robust visual question answering. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 3285–3292
Gat I, Schwartz I, Schwing A, Hazan T (2020) Removing bias in multi-modal classifiers: regularization by maximizing functional entropies. arXiv preprint arXiv:2010.10802
Guo Y, Nie L, Cheng Z, Tian Q (2020) Loss-rescaling VQA: revisiting Language Prior Problem from a Class-imbalance View. arXiv:2010.16010
Guo Y, Cheng Z, Nie L, Liu Y, Wang Y, Kankanhalli M (2019) Quantifying and alleviating the language prior problem in visual question answering. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 75–84
Zhu X, Mao Z, Liu C, Zhang P, Wang B, Zhang Y (2020) Overcoming language priors with self-supervised learning for visual question answering. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence main track, pp 1083–1089. https://doi.org/10.24963/ijcai.2020/151
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10800–10809
Grand G, Belinkov Y (2019) Adversarial regularization for visual question answering: strengths, shortcomings, and side effects. In: Proceedings of the second workshop on shortcomings in vision and language, pp 1–13
Ramakrishnan S, Agrawal A, Lee S (2018) Overcoming language priors in visual question answering with adversarial regularization. In: Proceedings of the 32nd international conference on neural information processing systems, pp 1548–1558
Cadene R, Dancette C, Ben-Younes H, Cord M, Parikh D (2019) RUBi: reducing unimodal biases for visual question answering. In: Proceedings of the neural information processing systems, vol 32. Curran Associates Inc, pp 841–852
Gokhale T, Banerjee P, Baral C, Yang Y (2020) MUTANT: a training paradigm for out-of-distribution generalization in visual question answering. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 878–892
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp. 2425–2433
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1724–1734
Teney D, Abbasnedjad E, Hengel A V D (2020) Learning what makes a difference from counterfactual examples and gradient supervision. arXiv preprint arXiv:2004.09034
Yang C, Feng S, Li D, Shen H, Wang G, Jiang B (2020) Learning content and context with language bias for visual question answering. In: Proceedings of the IEEE international conference on multimedia and expo (ICME), pp 1–6. https://doi.org/10.1109/ICME51207.2021.9428098
Kervadec C, Antipov G, Baccouche M, Wolf C (2020) Estimating semantic structure for the VQA answer space. arXiv preprint arXiv:2006.05726
Wang M, Qi G, Wang H F, Zheng Q S (2019) Richpedia: a comprehensive multi-modal knowledge graph. In: Joint International Semantic Technology Conference(JIST), pp 130–145
Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Article Google Scholar
Wang M, Wang H, Qi G, Zheng Q (2020) Richpedia: a large-scale, comprehensive multi-modal knowledge graph. Big Data Res 22:100159
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61906044, in part by the China Postdoctoral Science Foundation under Grant 2020M681984, and in part by the key projects of natural science research in Anhui colleges and universities under Grant KJ2019A0532, KJ2019A0536, and KJ2020ZD48.

Author information

Authors and Affiliations

Fuyang Normal University, Fuyang, 236037, China
Jia Zhao, Xuesong Zhang, Xuefeng Wang, Ying Yang & Gang Sun
Hefei University of Technology, Hefei, 230009, China
Jia Zhao

Authors

Jia Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xuesong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Yang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, J., Zhang, X., Wang, X. et al. Overcoming language priors in VQA via adding visual module. Neural Comput & Applic 34, 9015–9023 (2022). https://doi.org/10.1007/s00521-022-06923-0

Download citation

Received: 24 June 2021
Accepted: 04 January 2022
Published: 25 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s00521-022-06923-0

Overcoming language priors in VQA via adding visual module

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Overcoming language priors in VQA via adding visual module

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation