Article

Learning to answer questions from image using convolutional neural network

Authors:

Hang LiAuthors Info & Claims

AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

Pages 3567 - 3573

Published: 12 February 2016 Publication History

Abstract

In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA) task. Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances significantly outperforming the state-of-the-art.

References

[1]

Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: visual question answering. arXiv 1505.00468.

[2]

Chen, X., and Zitnick, C. L. 2014. Learning a recurrent visual representation for image caption generation. arXiv 1411.5654.

[3]

Dahl, G. E.; Sainath, T. N.; and Hinton, G. E. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP.

[4]

Donahue, J.; Hendricks, L. A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2014. Long-term recurrent convolutional networks for visual recognition and description. arXiv 1411.4389.

[5]

Fang, H.; Gupta, S.; Iandola, F. N.; Srivastava, R.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; Zitnick, C. L.; and Zweig, G. 2014. From captions to visual concepts and back. arXiv 1411.4952.

[6]

Frome, A.; Corrado, G.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. In NIPS.

[7]

Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv 1505.05612.

[8]

Geman, D.; Geman, S.; Hallonquist, N.; and Younes, L. 2015. Visual turing test for computer vision systems. In PNAS.

[9]

Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS.

[10]

Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. In ACL.

[11]

Karpathy, A., and Li, F.-F. 2014. Deep visual-semantic alignments for generating image descriptions. arXiv 1412.2306.

[12]

Karpathy, A.; Joulin, A.; and Li, F.-F. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS.

[13]

Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP.

[14]

Kiros, R.; Salakhutdinov, R.; and Zemel, R. 2014a. Multimodal neural language models. In ICML.

[15]

Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 1411.2539.

[16]

Klein, B.; Lev, G.; Sadeh, G.; and Wolf, L. 2015. Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In CVPR.

[17]

Lecun, Y., and Bengio, Y. 1995. Convolutional networks for images, speech and time series. The Handbook of Brain Theory and Neural Networks.

[18]

Ma, L.; Lu, Z.; Shang, L.; and Li, H. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV.

[19]

Malinowski, M., and Fritz, M. 2014a. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.

[20]

Malinowski, M., and Fritz, M. 2014b. Towards a visual turing challenge. arXiv 1410.8027.

[21]

Malinowski, M., and Fritz, M. 2015. Hard to cheat: A turing test based on answering questions about images. arXiv 1501.03302.

[22]

Malinowski, M.; Rohrbach, M.; and Fritz, M. 2015. Ask your neurons: A neural-based approach to answering questions about images. arXiv 1505.01121.

[23]

Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L. 2014a. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv 1412.6632.

[24]

Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L. 2014b. Explain images with multimodal recurrent neural networks. arXiv 1410.1090.

[25]

Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv 1301.3781.

[26]

Nakamura, T.; Nagai, T.; Funakoshi, K.; Nagasaka, S.; Taniguchi, T.; and Iwahashi, N. 2013. Mutual learning of an object concept and language model based on mlda and npylm. In IROS.

[27]

Ordonez, V.; Kulkarni, G.; and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS.

[28]

Ren, M.; Kiros, R.; and Zemel, R. S. 2015. Exploring models and data for image question answering. arXiv 1505.02074.

[29]

Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556.

[30]

Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; and Ng, A. Y. 2014. Grounded compositional semantics for finding and describing images with sentences. In TACL.

[31]

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR.

[32]

Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2014. Show and tell: A neural image caption generator. arXiv 1411.4555.

[33]

Wu, Z., and Palmer, M. S. 1994. Verb semantics and lexical selection. In ACL.

[34]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015a. Show, attend and tell: Neural image caption generation with visual attention. arXiv 1502.03044.

[35]

Xu, R.; Xiong, C.; Chen, W.; and Corso, J. 2015b. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI.

Cited By

Kuang JShen YXie JLuo HXu ZLi RLi YCheng XLin XHan Y(2025)Natural Language Understanding and Inference with MLLM in Visual Question Answering: A SurveyACM Computing Surveys10.1145/371168057:8(1-36)Online publication date: 31-Jan-2025
https://dl.acm.org/doi/10.1145/3711680
Manmadhan SKovoor B(2023)Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question AnsweringInternational Journal of Intelligent Information Technologies10.4018/IJIIT.31867119:1(1-19)Online publication date: 3-Mar-2023
https://dl.acm.org/doi/10.4018/IJIIT.318671
Mishra AAnand AGuha P(2023)Aggregated Co-attention based Visual Question AnsweringProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627659(1-10)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3627631.3627659
Show More Cited By

Learning to answer questions from image using convolutional neural network
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Edge-preserving image denoising using a deep convolutional neural network
Highlights
- This paper makes use of a deep CNN for image denoising.
- The network is trained ...
Abstract
This paper introduces a novel denoising approach making use of a deep convolutional neural network to preserve image edges. The network is trained by using the edge map obtained from the well-known Canny algorithm and aims at ...
A dyadic multi-resolution deep convolutional neural wavelet network for image classification

For almost the past four decades, image classification has gained a lot of attention in the field of pattern recognition due to its application in various fields. Given its importance, several approaches have been proposed up to now. In this paper, we ...
Medical image denoising using convolutional neural network: a residual learning approach

In medical imaging, denoising is very important for analysis of images, diagnosis and treatment of diseases. Currently, image denoising methods based on deep learning are effective, where the methods are however limited for the requirement of training ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

February 2016

4406 pages

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 12 February 2016

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kuang JShen YXie JLuo HXu ZLi RLi YCheng XLin XHan Y(2025)Natural Language Understanding and Inference with MLLM in Visual Question Answering: A SurveyACM Computing Surveys10.1145/371168057:8(1-36)Online publication date: 31-Jan-2025
https://dl.acm.org/doi/10.1145/3711680
Manmadhan SKovoor B(2023)Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question AnsweringInternational Journal of Intelligent Information Technologies10.4018/IJIIT.31867119:1(1-19)Online publication date: 3-Mar-2023
https://dl.acm.org/doi/10.4018/IJIIT.318671
Mishra AAnand AGuha P(2023)Aggregated Co-attention based Visual Question AnsweringProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627659(1-10)Online publication date: 15-Dec-2023
https://dl.acm.org/doi/10.1145/3627631.3627659
Guo ZLiu SLiu JLi YChen LLu HZhou Y(2021)How Far Have We Progressed in Identifying Self-admitted Technical Debts? A Comprehensive Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/344724730:4(1-56)Online publication date: 23-Jul-2021
https://dl.acm.org/doi/10.1145/3447247
Wang XHuang WLiu QYin YHuang ZWu LMa JWang XWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Fine-Grained Similarity Measurement between Educational Videos and ExercisesProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413783(331-339)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413783
Lin JJain USchwing AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)TAB-VCRProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455686(15615-15628)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455686
Cheng CTan FHou XWei Z(2019)Success prediction on crowdfunding with multimodal deep learningProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367243.3367339(2158-2164)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367243.3367339
Li CZhou Z(2019)Visual Question Answering with Dynamic Parameter Prediction using Functional HashingProceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence10.1145/3366194.3366252(330-335)Online publication date: 20-Sep-2019
https://dl.acm.org/doi/10.1145/3366194.3366252
Srivastava ALiu HFujita SZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A PlatformsProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358000(1421-1430)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3358000
Ren XXing ZXia XLo DWang XGrundy J(2019)Neural Network-based Detection of Self-Admitted Technical DebtACM Transactions on Software Engineering and Methodology10.1145/332491628:3(1-45)Online publication date: 29-Jul-2019
https://dl.acm.org/doi/10.1145/3324916
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten