Visual commonsense reasoning with directional visual connections

Yahong Han (韩亚洪) ORCID: orcid.org/0000-0003-2768-1398^1,2,
Aming Wu (武阿明)¹,
Linchao Zhu (朱霖潮)³ &
…
Yi Yang (杨易)³

217 Accesses
Explore all metrics

Abstract

To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.

摘要

为推动认知层面视觉内容理解的研究, 即基于视觉细节的深入理解做出精确推理, 视觉常识推理的概念被提出. 相比仅需模型正确回答问题的传统视觉问答, 视觉常识推理不仅需要模型正确地回答问题, 还需给出相应解释. 最近关于人类认知的研究指出大脑认知可以看作局部神经元连接的全局动态集成, 有助于解决特定的认知任务. 受其启发, 本文提出有向连接网络. 通过使用问题和答案的语义来情景化视觉神经元从而动态重组神经元连接, 以及借助方向信息增强推理能力, 所提方法能有效实现视觉常识推理. 具体地, 首先开发一个GraphVLAD模块来捕捉能够充分表达视觉内容相关性的视觉神经元连接. 然后提出一个情景化模型来融合视觉和文本表示. 最后, 基于情景化连接的输出设计有向连接来推断答案及对应解释, 其中包含了ReasonVLAD模块. 实验结果和可视化分析证明了所提方法的有效性.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

Article 03 January 2022

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Question-relationship guided graph attention network for visual question answer

Article 15 March 2021

References

Anderson P, He XD, Buehler C, et al., 2018. Bottom-up and top-down attention for image captioning and visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Google Scholar
Antol S, Agrawal A, Lu JS, et al., 2015. VQA: visual question answering. Proc IEEE Int Conf on Computer Vision, p.2425–2433. https://doi.org/10.1109/ICCV.2015.279
Google Scholar
Arandjelović R, Gronat P, Torii A, et al., 2018. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Patt Anal Mach Intell, 40(6):1437–1451. https://doi.org/10.1109/TPAMI.2017.2711011
Article Google Scholar
Badrinarayanan V, Kendall A, Cipolla R, 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell, 39(12):2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
Article Google Scholar
Bansal A, Zhang YT, Chellappa R, 2020. Visual question answering on image sets. European Conf on Computer Vision, p.51–67. https://doi.org/10.1007/978-3-030-58589-1_4
Google Scholar
Ben-younes H, Cadene R, Cord M, et al., 2017. MUTAN: multimodal tucker fusion for visual question answering. Proc IEEE Int Conf on Computer Vision, p.2631–2639. https://doi.org/10.1109/ICCV.2017.285
Google Scholar
Bola M, Sabel BA, 2015. Dynamic reorganization of brain functional networks during cognition. NeuroImage, 114:398–413. https://doi.org/10.1016/j.neuroimage.2015.03.057
Article Google Scholar
Cadene R, Ben-younes H, Cord M, et al., 2019. MUREL: multimodal relational reasoning for visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1989–1998. https://doi.org/10.1109/CVPR.2019.00209
Google Scholar
Chen L, Yan X, Xiao J, et al., 2020. Counterfactual samples synthesizing for robust visual question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10797–10806. https://doi.org/10.1109/CVPR42600.2020.01081
Google Scholar
Chen LC, Papandreou G, Kokkinos I, et al., 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Patt Anal Mach Intell, 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Article Google Scholar
Chen YP, Rohrbach M, Yan ZC, et al., 2019. Graph-based global reasoning networks. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.433–442. https://doi.org/10.1109/CVPR.2019.00052
Google Scholar
Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p.4171–4186. https://doi.org/10.18653/v1/N19-1423
Google Scholar
Feltovich PJ, Ford KM, Hoffman RR, 1997. Expertise in Context: Human and Machine. MIT Press, Cambridge, MA, USA, p.67–99.
Google Scholar
Gao P, Li H, Li S, et al., 2018. Question-guided hybrid convolution for visual question answering. European Conf on Computer Vision, p.485–501. https://doi.org/10.1007/978-3-030-01246-5_29
Google Scholar
Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440–1448. https://doi.org/10.1109/ICCV.2015.169
Google Scholar
Goyal Y, Khot T, Summers-Stay D, et al., 2017. Making the V in VQA matter: elevating the role of image understanding in visual question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6325–6334.
Google Scholar
He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778.
Google Scholar
Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Jégou H, Douze M, Schmid C, et al., 2010. Aggregating local descriptors into a compact image representation. Proc IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.3304–3311. https://doi.org/10.1109/CVPR.2010.5540039
Google Scholar
Kim KM, Choi SH, Kim JH, et al., 2018. Multimodal dual attention memory for video story question answering. https://arxiv.org/abs/1809.07999
Book Google Scholar
Kipf TN, Welling M, 2016. Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907v4
Google Scholar
Le TM, Le V, Venkatesh S, et al., 2020. Hierarchical conditional relation networks for video question answering. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9969–9978. https://doi.org/10.1109/CVPR42600.2020.00999
Google Scholar
Li G, Duan N, Fang YJ, et al., 2020. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf on Artificial Intelligence, p.11336–11344. https://doi.org/10.1609/aaai.v34i07.6795
Google Scholar
Li LH, Yatskar M, Yin D, et al., 2019. VisualBERT: a simple and performant baseline for vision and language. https://arxiv.org/abs/1908.03557
Google Scholar
Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. European Conf on Computer Vision, p.21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Google Scholar
Lu JS, Xiong CM, Parikh D, et al., 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3242–3250. https://doi.org/10.1109/CVPR.2017.345
Google Scholar
Lu JS, Batra D, Parikh D, et al., 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. https://arxiv.org/abs/1908.02265
Google Scholar
Malinowski M, Doersch C, Santoro A, et al., 2018. Learning visual question answering by bootstrapping hard attention. European Conf on Computer Vision, p.3–20. https://doi.org/10.1007/978-3-030-01231-1_1
Google Scholar
Monti F, Boscaini D, Masci J, et al., 2017. Geometric deep learning on graphs and manifolds using mixture model CNNs. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5425–5434. https://doi.org/10.1109/CVPR.2017.576
Google Scholar
Narasimhan M, Lazebnik S, Schwing AG, 2018. Out of the box: reasoning with graph convolution nets for factual visual question answering. Proc 32^nd Int Conf on Neural Information Processing Systems, p.2659–2670.
Google Scholar
Norcliffe-Brown W, Vafeias ES, Parisot S, 2018. Learning conditioned graph structures for interpretable visual question answering. https://arxiv.org/abs/1806.07243
Google Scholar
Pan YH, 2019. On visual knowledge. Front Inform Technol Electron Eng, 20(8):1021–1025. https://doi.org/10.1631/FITEE.1910001
Article Google Scholar
Pan YH, 2020. Miniaturized five fundamental issues about visual knowledge. Front Inform Technol Electron Eng, online. https://doi.org/10.1631/FITEE.2040000
Google Scholar
Park HJ, Friston K, 2013. Structural and functional brain networks: from connections to cognition. Science, 342(6158):1238411. https://doi.org/10.1126/science.1238411
Article Google Scholar
Perez E, Strub F, de Vries H, et al., 2017. FiLM: visual reasoning with a general conditioning layer. https://arxiv.org/abs/1709.07871v2
Google Scholar
Schwartz I, Yu S, Hazan T, et al., 2019. Factor graph attention. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2039–2048. https://doi.org/10.1109/CVPR.2019.00214
Google Scholar
Su WJ, Zhu XZ, Cao Y, et al., 2019. VL-BERT: pre-training of generic visual-linguistic representations. https://arxiv.org/abs/1908.08530v1
Google Scholar
van der Maaten L, Hinton G, 2008. Visualizing data using t-SNE. J Mach Learn Res, 9:2579–2605.
MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000–6010.
Google Scholar
Veličković P, Cucurull G, Casanova A, et al., 2018. Graph attention networks. Proc Int Conf on Learning Representations.
Google Scholar
Wu AM, Zhu LC, Han YH, et al., 2019. Connective cognition network for directional visual commonsense reasoning. Proc 33^rd Conf on Neural Information Processing Systems, p.5669–5679.
Google Scholar
Xu K, Ba JL, Kiros R, et al., 2015. Show, attend and tell: neural image caption generation with visual attention. Proc 32^nd Int Conf on Machine Learning, p.2048–2057.
Google Scholar
Xu K, Wu LF, Wang ZG, et al., 2018. Exploiting rich syntactic information for semantic parsing with graph-to-sequence model. Proc Conf on Empirical Methods in Natural Language Processing, p.918–924.
Google Scholar
Zellers R, Bisk Y, Farhadi A, et al., 2019. From recognition to cognition: visual commonsense reasoning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6713–6724. https://doi.org/10.1109/CVPR.2019.00688
Google Scholar
Zhou J, Cui GQ, Zhang ZY, et al., 2018. Graph neural networks: a review of methods and applications. https://arxiv.org/abs/1812.08434v3
Google Scholar

Download references

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
Yahong Han (韩亚洪) & Aming Wu (武阿明)
Tianjin Key Lab of Machine Learning, Tianjin University, Tianjin, 300350, China
Yahong Han (韩亚洪)
School of Computer Science, University of Technology Sydney, Sydney, 2007, Australia
Linchao Zhu (朱霖潮) & Yi Yang (杨易)

Authors

Yahong Han (韩亚洪)
View author publications
You can also search for this author in PubMed Google Scholar
Aming Wu (武阿明)
View author publications
You can also search for this author in PubMed Google Scholar
Linchao Zhu (朱霖潮)
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang (杨易)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yahong HAN designed the research. Aming WU conducted the experiments and drafted the manuscript. Linchao ZHU helped organize the manuscript. Yi YANG revised the paper.

Corresponding author

Correspondence to Yahong Han (韩亚洪).

Ethics declarations

Yahong HAN, Aming WU, Linchao ZHU, and Yi YANG declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61876130 and 61932009)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, Y., Wu, A., Zhu, L. et al. Visual commonsense reasoning with directional visual connections. Front Inform Technol Electron Eng 22, 625–637 (2021). https://doi.org/10.1631/FITEE.2000722

Download citation

Received: 25 December 2020
Accepted: 06 February 2021
Published: 28 May 2021
Issue Date: May 2021
DOI: https://doi.org/10.1631/FITEE.2000722

Key words

关键词

CLC number

TP181

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Question-relationship guided graph attention network for visual question answer

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Key words

关键词

CLC number

Subscribe and save

Buy Now

Navigation

Visual commonsense reasoning with directional visual connections

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Question-relationship guided graph attention network for visual question answer

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Subscribe and save

Buy Now

Search

Navigation