Article

VQA-LOL: Visual Question Answering Under the Lens of Logic

Authors:

Pratyay Banerjee,

Yezhou YangAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI

Pages 379 - 396

https://doi.org/10.1007/978-3-030-58589-1_23

Published: 23 August 2020 Publication History

Abstract

Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

References

[1]

Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image understanding. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259. AAAI Press (2019). http://dl.acm.org/citation.cfm?id=3367722.3367926

[2]

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

[3]

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

[4]

Asai, A., Hajishirzi, H.: Logic-guided data augmentation and regularization for consistent question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5642–5650. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.499

[5]

Bhattacharya, N., Li, Q., Gurari, D.: Why does a visual question have different answers? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4271–4280 (2019)

[6]

Bobrow, D.G.: Natural language input for a computer problem solving system (1964)

[7]

Boole, G.: An investigation of the laws of thought: on which are founded themathematical theories of logic and probabilities. Dover Publications (1854)

[8]

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)

[9]

Bowman, S.R., Potts, C., Manning, C.D.: Recursive neural networks can learn logical semantics. arXiv preprint arXiv:1406.1827 (2014)

[10]

Carey S Conceptual Change in Childhood 1985 Cambridge MIT Press

[11]

Cesana-Arlotti, N., Martín, A., Téglás, E., Vorobyova, L., Cetnarski, R., Bonatti, L.L.: Precursors of logical reasoning in preverbal human infants. Science 359(6381), 1263–1266 (2018). https://science.sciencemag.org/content/359/6381/1263

[12]

Corcoran J Completeness of an ancient logic J. Symb. Logic 1972 37 4 696-702

[13]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

[14]

Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2commonsense: generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020)

[15]

Fréchet M Généralisation du théoreme des probabilités totales Fundamenta Mathematicae 1935 1 25 379-387

[16]

Gopnik, A., Meltzoff, A.N., Kuhl, P.K.: The Scientist in the Crib: Minds, Brains, and How Children Learn. William Morrow & Co (1999)

[17]

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

[18]

Hegel, G.W.F.: Hegel’s science of logic (1929)

[19]

Horn, L.R., Kato, Y.: Negation and Polarity: Syntactic and Semantic Perspectives. OUP, Oxford (2000)

[20]

Hudson, D.A., Manning, C.D.: GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506 (2019)

[21]

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)

[22]

Kassner, N., Schütze, H.: Negated lama: birds cannot fly. arXiv preprint arXiv:1911.03343 (2019)

[23]

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

[24]

Lewis, M., Steedman, M.: Combined distributional and logical semantics. Trans. Assoc. Comput. Linguist. 1, 179–192 (2013). https://www.aclweb.org/anthology/Q13-1015

[25]

Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755

[26]

Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

[27]

Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)

[28]

Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)

[29]

Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJgMlhRctm

[30]

Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011. Association for Computational Linguistics (2009)

[31]

Morante R and Sporleder C Modality and negation: an introduction to the special issue Comput. Linguist. 2012 38 2 223-260

[32]

Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base completion. arXiv preprint arXiv:1504.06662 (2015)

[33]

Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

[34]

Piattelli-Palmarini, M.: Language and learning: the debate between jean piaget and noam chomsky (1980)

[35]

Raju, P.: The principle of four-cornered negation in Indian philosophy. Rev. Metaphys. 694–713 (1954)

[36]

Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995). https://www.aclweb.org/anthology/W95-0107

[37]

Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015)

[38]

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

[39]

Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix factorization and universal schemas. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 74–84. Association for Computational Linguistics, June 2013. https://www.aclweb.org/anthology/N13-1008

[40]

Rocktäschel, T., Bošnjak, M., Singh, S., Riedel, S.: Low-dimensional embeddings of logic. In: Proceedings of the ACL 2014 Workshop on Semantic Parsing, pp. 45–49 (2014)

[41]

Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems, pp. 926–934 (2013)

[42]

Spinoza, B.D.: Ethics, translated by andrew boyle, introduction by ts gregory (1934)

[43]

Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

[44]

Wittgenstein, L.: Tractatus Logico-Philosophicus. Routledge (1921)

[45]

Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

[46]

Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)

[47]

Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420 (2012)

[48]

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)

Cited By

Liang PZadeh AMorency L(2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3656580
Chatterjee ALuo YGokhale TYang YBaral C(2024)REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73404-5_20(339-357)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73404-5_20
Chatterjee AStan GAflalo EPaul SGhosh DGokhale TSchmidt LHajishirzi HLal VBaral CYang Y(2024)Getting it Right: Improving Spatial Consistency in Text-to-Image ModelsComputer Vision – ECCV 202410.1007/978-3-031-72670-5_12(204-222)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72670-5_12
Show More Cited By

Index Terms

VQA-LOL: Visual Question Answering Under the Lens of Logic
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Multimodal attention-driven visual question answering for Malayalam
Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English ...
VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering
Highlights
- The proposed model is a free form, open ended and knowledge aware VQA model.
Abstract
With recent advancements in machine perception and scene understanding, Visual Question Answering (VQA) has garnered much attraction from researchers in the direction of training neural models for jointly analyzing, grounding and ...
Visual question answering: Which investigated applications?
Highlights
- The paper presents concrete applications of Visual Question Answering
- Domains where VQA has been experimented are presented together with the exploited dataset
- The paper suggests some challenging techniques that can be especially ...
Abstract
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Computer Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video summarization, the semantic information is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI

Aug 2020

831 pages

ISBN:978-3-030-58588-4

DOI:10.1007/978-3-030-58589-1

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liang PZadeh AMorency L(2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3656580
Chatterjee ALuo YGokhale TYang YBaral C(2024)REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73404-5_20(339-357)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73404-5_20
Chatterjee AStan GAflalo EPaul SGhosh DGokhale TSchmidt LHajishirzi HLal VBaral CYang Y(2024)Getting it Right: Improving Spatial Consistency in Text-to-Image ModelsComputer Vision – ECCV 202410.1007/978-3-031-72670-5_12(204-222)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72670-5_12
Liang PMorency L(2023)Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open QuestionsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617602(101-104)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3617602
Liang PLyu YChhablani GJain NDeng ZWang XMorency LSalakhutdinov R(2023)MultiViz: Towards User-Centric Visualizations and Interpretations of Multimodal ModelsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3585604(1-21)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3585604
Sato YMineshima K(2023)Can Machines and Humans Use Negation When Describing Images?Human and Artificial Rationalities10.1007/978-3-031-55245-8_3(39-47)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-55245-8_3
Raja SMondal AJawahar C(2023)ICDAR 2023 Competition on Visual Question Answering on Business Document ImagesDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_26(454-470)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41679-8_26
Li QXiao FBhanu BSheng BHong R(2022)Inner Knowledge-based Img2Doc Scheme for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/348914218:3(1-21)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3489142
Chen LZheng YXiao J(2022)Rethinking Data Augmentation for Robust Visual Question AnsweringComputer Vision – ECCV 202210.1007/978-3-031-20059-5_6(95-112)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20059-5_6
Thomas CZhang YChang S(2022)Fine-Grained Visual EntailmentComputer Vision – ECCV 202210.1007/978-3-031-20059-5_23(398-416)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20059-5_23
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents