Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-58589-1_23guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

VQA-LOL: Visual Question Answering Under the Lens of Logic

Published: 23 August 2020 Publication History

Abstract

Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

References

[1]
Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image understanding. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259. AAAI Press (2019). http://dl.acm.org/citation.cfm?id=3367722.3367926
[2]
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
[3]
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
[4]
Asai, A., Hajishirzi, H.: Logic-guided data augmentation and regularization for consistent question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5642–5650. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.499
[5]
Bhattacharya, N., Li, Q., Gurari, D.: Why does a visual question have different answers? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4271–4280 (2019)
[6]
Bobrow, D.G.: Natural language input for a computer problem solving system (1964)
[7]
Boole, G.: An investigation of the laws of thought: on which are founded themathematical theories of logic and probabilities. Dover Publications (1854)
[8]
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)
[9]
Bowman, S.R., Potts, C., Manning, C.D.: Recursive neural networks can learn logical semantics. arXiv preprint arXiv:1406.1827 (2014)
[10]
Carey S Conceptual Change in Childhood 1985 Cambridge MIT Press
[11]
Cesana-Arlotti, N., Martín, A., Téglás, E., Vorobyova, L., Cetnarski, R., Bonatti, L.L.: Precursors of logical reasoning in preverbal human infants. Science 359(6381), 1263–1266 (2018). https://science.sciencemag.org/content/359/6381/1263
[12]
Corcoran J Completeness of an ancient logic J. Symb. Logic 1972 37 4 696-702
[13]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[14]
Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2commonsense: generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020)
[15]
Fréchet M Généralisation du théoreme des probabilités totales Fundamenta Mathematicae 1935 1 25 379-387
[16]
Gopnik, A., Meltzoff, A.N., Kuhl, P.K.: The Scientist in the Crib: Minds, Brains, and How Children Learn. William Morrow & Co (1999)
[17]
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
[18]
Hegel, G.W.F.: Hegel’s science of logic (1929)
[19]
Horn, L.R., Kato, Y.: Negation and Polarity: Syntactic and Semantic Perspectives. OUP, Oxford (2000)
[20]
Hudson, D.A., Manning, C.D.: GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506 (2019)
[21]
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
[22]
Kassner, N., Schütze, H.: Negated lama: birds cannot fly. arXiv preprint arXiv:1911.03343 (2019)
[23]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[24]
Lewis, M., Steedman, M.: Combined distributional and logical semantics. Trans. Assoc. Comput. Linguist. 1, 179–192 (2013). https://www.aclweb.org/anthology/Q13-1015
[25]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[26]
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[27]
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)
[28]
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)
[29]
Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJgMlhRctm
[30]
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011. Association for Computational Linguistics (2009)
[31]
Morante R and Sporleder C Modality and negation: an introduction to the special issue Comput. Linguist. 2012 38 2 223-260
[32]
Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base completion. arXiv preprint arXiv:1504.06662 (2015)
[33]
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
[34]
Piattelli-Palmarini, M.: Language and learning: the debate between jean piaget and noam chomsky (1980)
[35]
Raju, P.: The principle of four-cornered negation in Indian philosophy. Rev. Metaphys. 694–713 (1954)
[36]
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995). https://www.aclweb.org/anthology/W95-0107
[37]
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015)
[38]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
[39]
Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix factorization and universal schemas. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 74–84. Association for Computational Linguistics, June 2013. https://www.aclweb.org/anthology/N13-1008
[40]
Rocktäschel, T., Bošnjak, M., Singh, S., Riedel, S.: Low-dimensional embeddings of logic. In: Proceedings of the ACL 2014 Workshop on Semantic Parsing, pp. 45–49 (2014)
[41]
Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems, pp. 926–934 (2013)
[42]
Spinoza, B.D.: Ethics, translated by andrew boyle, introduction by ts gregory (1934)
[43]
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
[44]
Wittgenstein, L.: Tractatus Logico-Philosophicus. Routledge (1921)
[45]
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
[46]
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)
[47]
Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420 (2012)
[48]
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)

Cited By

View all
  • (2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
  • (2024)REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73404-5_20(339-357)Online publication date: 29-Sep-2024
  • (2024)Getting it Right: Improving Spatial Consistency in Text-to-Image ModelsComputer Vision – ECCV 202410.1007/978-3-031-72670-5_12(204-222)Online publication date: 29-Sep-2024
  • Show More Cited By

Index Terms

  1. VQA-LOL: Visual Question Answering Under the Lens of Logic
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI
        Aug 2020
        831 pages
        ISBN:978-3-030-58588-4
        DOI:10.1007/978-3-030-58589-1

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 23 August 2020

        Author Tags

        1. Visual question answering
        2. Logical robustness

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 23 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys10.1145/365658056:10(1-42)Online publication date: 22-Jun-2024
        • (2024)REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language ModelsComputer Vision – ECCV 202410.1007/978-3-031-73404-5_20(339-357)Online publication date: 29-Sep-2024
        • (2024)Getting it Right: Improving Spatial Consistency in Text-to-Image ModelsComputer Vision – ECCV 202410.1007/978-3-031-72670-5_12(204-222)Online publication date: 29-Sep-2024
        • (2023)Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open QuestionsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617602(101-104)Online publication date: 9-Oct-2023
        • (2023)MultiViz: Towards User-Centric Visualizations and Interpretations of Multimodal ModelsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3585604(1-21)Online publication date: 19-Apr-2023
        • (2023)Can Machines and Humans Use Negation When Describing Images?Human and Artificial Rationalities10.1007/978-3-031-55245-8_3(39-47)Online publication date: 19-Sep-2023
        • (2023)ICDAR 2023 Competition on Visual Question Answering on Business Document ImagesDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41679-8_26(454-470)Online publication date: 21-Aug-2023
        • (2022)Inner Knowledge-based Img2Doc Scheme for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/348914218:3(1-21)Online publication date: 4-Mar-2022
        • (2022)Rethinking Data Augmentation for Robust Visual Question AnsweringComputer Vision – ECCV 202210.1007/978-3-031-20059-5_6(95-112)Online publication date: 23-Oct-2022
        • (2022)Fine-Grained Visual EntailmentComputer Vision – ECCV 202210.1007/978-3-031-20059-5_23(398-416)Online publication date: 23-Oct-2022
        • Show More Cited By

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media