Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Inner Knowledge-based Img2Doc Scheme for Visual Question Answering

Published: 04 March 2022 Publication History

Abstract

Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.

References

[1]
Manoj Acharya, Kushal Kafle, and Christopher Kanan. 2019. TallyQA: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–9.
[2]
Vedika Agarwal, Rakshith Shetty, and Mario Fritz. 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9690–96980.
[3]
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.
[4]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–15.
[5]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433. DOI: DOI:
[6]
Sören Auer, Chris Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2008. DBpedia: A nucleus for a web of open data. In Proceedings of the International Semantic Web Conference. 722–735.
[7]
Zongwen Bai, Ying Li, Marcin Wozniak, Meili Zhou, and Di Li. 2021. DecomVQANet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition 110, 107538 (2021), 1–10.
[8]
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.
[9]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C. V. Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4291–4301.
[10]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD/PODS International Conference on Management of Data. 1247–1250.
[11]
I. Bernard Cohen. 1987. Newton’s third law and universal gravity. Journal of the History of Ideas 48, 4 (1987), 571–593. Retrieved from http://www.jstor.org/stable/2709688.
[12]
David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher A. Welty. 2010. Building watson: An overview of the DeepQA project. AI Magazine 31, 3 (2010), 59–79.
[13]
Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321–2334. DOI: DOI:
[14]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1533–1544.
[15]
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2296–2304.
[16]
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. VQA-LOL: Visual question answering under the lens of logic. In Proceedings of the European Conference on Computer Vision. 379–396.
[17]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–11.
[18]
Jun Guo, Hanliang Guo, and Zhanyi Wang. 2011. An activation force-based affinity measure for analyzing complex networks. Scientific Reports 1, 113 (2011), 1–9.
[19]
Drew Hudson and Christopher D. Manning. 2019. Learning by abstraction: The neural state machine. In Proceedings of the International Conference on Neural Information Processing Systems. 5901–5914.
[20]
Drew A. Hudson and Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR’18). 1–10.
[21]
Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. 727–739.
[22]
J. Johnson, A. Karpathy, and L. Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574. DOI: DOI:
[23]
Kushal Kafle and Christopher Kanan. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4976–4984. DOI: DOI:
[24]
Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision & Image Understanding 163, 1 (2017), 3–20.
[25]
Vahid Kazemi and Ali Elqursh. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. http://arxiv.org/abs/1704.03162.
[26]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 1564–1574.
[27]
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1–9.
[28]
Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1338–1346.
[29]
Qun Li, Fu Xiao, Le An, Xianzhong Long, and Xiaochuan Sun. 2019. Semantic concept network and deep walk-based visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 49 (2019), 19 pages. DOI: DOI:
[30]
Zechao Li, Jinhui Tang, and Tao Mei. 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 9 (2019), 2070–2083. DOI: DOI:
[31]
Z. Li, J. Tang, L. Zhang, and J. Yang. 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 2265–2278.
[32]
Tsung-Yu. Lin, Aruni Roychowdhury, and Subhransu Maji. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1309–1322.
[33]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS’14).
[34]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110–135. DOI: DOI:
[35]
G. A. Miller. 1995. WordNet: A lexical database for English. Communications of the Association for Computing Machinery 38, 11 (1995), 39–41.
[36]
Pushmeet Kohli Nathan Silberman, Derek Hoiem, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746–760.
[37]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30–38.
[38]
Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 8334–8343.
[39]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In ACM SIGKDD International Conference on Knowledge Discovery and Data (KDD). 701–710.
[40]
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Image question answering: A visual semantic embedding model and a new dataset. In Proceedings of the International Conference on Neural Information Processing Systems. 1–10.
[41]
Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Neural Information Processing Systems. 2953–2961.
[42]
Adam Santoro, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the Advances in Neural Information Processing Systems. 4967–4976.
[43]
Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer them all! Toward universal visual question answering models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10464–10473.
[44]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317–8326.
[45]
H. Tan and M. Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).
[46]
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2016. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2016), 2413–2427.
[47]
Ye Yi Wang. 1994. Verb semantics and lexical selection. Computer Science 14, 101 (1994), 325–327.
[48]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.
[49]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367–1381. DOI: DOI:
[50]
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163, 1 (2016), 21–40.
[51]
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622–4630.
[52]
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–3.
[53]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21–29.
[54]
Zhilin Yang, Wei Liang Yuan, Yuexin Wu, William W. Cohen, and Ruslan Salakhutdinov. 2016. Review networks for caption generation. In Proceedings of the International Conference on Neural Information Processing Systems. 1–9.
[55]
Liyan Zhang Yonghua Pan, Zechao Li, and Jinhui Tang. 2020. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the ACM Multimedia Asia. 3:1–3:7.
[56]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.
[57]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1839–1848.
[58]
Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29, 12 (2018), 5947–5959.
[59]
Jinglin Zhang, Pu Liu, Feng Zhang, and Qianqian Song. 2018. CloudNet: Ground-based cloud classification with deep convolutional neural network. Geophysical Research Letters 45, 16 (2018), 8665–8672.
[60]
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 12548–12558.
[61]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. 13041–13049.
[62]
Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995–5004.
[63]
Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base for visual question answering. https://arxiv.org/abs/1507.05670.

Cited By

View all
  • (2024)Multi-grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3660346Online publication date: 22-Apr-2024
  • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
  • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
August 2022
478 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3505208
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022
Accepted: 01 September 2021
Received: 01 May 2021
Published in TOMM Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. VQA
  2. dense image captioning
  3. Doc2Vec
  4. inner knowledge-based
  5. attribute network

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Nature Science Foundation of Jiangsu for Distinguished Young Scientist
  • Postdoctoral Research Plan of Jiangsu Province
  • Postdoctoral Science Foundation of China
  • Nanjing University of Posts and Telecommunications Program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)8
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-grained Representation Aggregating Transformer with Gating Cycle for Change CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3660346Online publication date: 22-Apr-2024
  • (2024)Effective Video Summarization by Extracting Parameter-Free Motion AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365467020:7(1-20)Online publication date: 16-May-2024
  • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
  • (2024)Distributed Learning Mechanisms for Anomaly Detection in Privacy-Aware Energy Grid Management SystemsACM Transactions on Sensor Networks10.1145/3640341Online publication date: 17-Jan-2024
  • (2024)BNoteHelper: A Note-based Outline Generation Tool for Structured Learning on Video-sharing PlatformsACM Transactions on the Web10.1145/363877518:2(1-30)Online publication date: 12-Mar-2024
  • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
  • (2024)Real-time Cyber-Physical Security Solution Leveraging an Integrated Learning-Based ApproachACM Transactions on Sensor Networks10.1145/358200920:2(1-22)Online publication date: 9-Jan-2024
  • (2024)RepSGG: Novel Representations of Entities and Relationships for Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.340214346:12(8018-8035)Online publication date: Dec-2024
  • (2024)Video question answering via traffic knowledge database and question classificationMultimedia Systems10.1007/s00530-023-01240-530:1Online publication date: 16-Jan-2024
  • (2023)An Efficient Algorithm for Resource Allocation in Mobile Edge Computing Based on Convex Optimization and Karush–Kuhn–Tucker MethodComplexity10.1155/2023/96044542023Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media