Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3016387.3016405guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Learning to answer questions from image using convolutional neural network

Published: 12 February 2016 Publication History

Abstract

In this paper, we propose to employ the convolutional neural network (CNN) for the image question answering (QA) task. Our proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer. More specifically, our model consists of three CNNs: one image CNN to encode the image content, one sentence CNN to compose the words of the question, and one multimodal convolution layer to learn their joint representation for the classification in the space of candidate answer words. We demonstrate the efficacy of our proposed model on the DAQUAR and COCO-QA datasets, which are two benchmark datasets for image QA, with the performances significantly outperforming the state-of-the-art.

References

[1]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: visual question answering. arXiv 1505.00468.
[2]
Chen, X., and Zitnick, C. L. 2014. Learning a recurrent visual representation for image caption generation. arXiv 1411.5654.
[3]
Dahl, G. E.; Sainath, T. N.; and Hinton, G. E. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In ICASSP.
[4]
Donahue, J.; Hendricks, L. A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2014. Long-term recurrent convolutional networks for visual recognition and description. arXiv 1411.4389.
[5]
Fang, H.; Gupta, S.; Iandola, F. N.; Srivastava, R.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; Zitnick, C. L.; and Zweig, G. 2014. From captions to visual concepts and back. arXiv 1411.4952.
[6]
Frome, A.; Corrado, G.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. In NIPS.
[7]
Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv 1505.05612.
[8]
Geman, D.; Geman, S.; Hallonquist, N.; and Younes, L. 2015. Visual turing test for computer vision systems. In PNAS.
[9]
Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural network architectures for matching natural language sentences. In NIPS.
[10]
Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. In ACL.
[11]
Karpathy, A., and Li, F.-F. 2014. Deep visual-semantic alignments for generating image descriptions. arXiv 1412.2306.
[12]
Karpathy, A.; Joulin, A.; and Li, F.-F. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS.
[13]
Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP.
[14]
Kiros, R.; Salakhutdinov, R.; and Zemel, R. 2014a. Multimodal neural language models. In ICML.
[15]
Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 1411.2539.
[16]
Klein, B.; Lev, G.; Sadeh, G.; and Wolf, L. 2015. Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In CVPR.
[17]
Lecun, Y., and Bengio, Y. 1995. Convolutional networks for images, speech and time series. The Handbook of Brain Theory and Neural Networks.
[18]
Ma, L.; Lu, Z.; Shang, L.; and Li, H. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV.
[19]
Malinowski, M., and Fritz, M. 2014a. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
[20]
Malinowski, M., and Fritz, M. 2014b. Towards a visual turing challenge. arXiv 1410.8027.
[21]
Malinowski, M., and Fritz, M. 2015. Hard to cheat: A turing test based on answering questions about images. arXiv 1501.03302.
[22]
Malinowski, M.; Rohrbach, M.; and Fritz, M. 2015. Ask your neurons: A neural-based approach to answering questions about images. arXiv 1505.01121.
[23]
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L. 2014a. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv 1412.6632.
[24]
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; and Yuille, A. L. 2014b. Explain images with multimodal recurrent neural networks. arXiv 1410.1090.
[25]
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv 1301.3781.
[26]
Nakamura, T.; Nagai, T.; Funakoshi, K.; Nagasaka, S.; Taniguchi, T.; and Iwahashi, N. 2013. Mutual learning of an object concept and language model based on mlda and npylm. In IROS.
[27]
Ordonez, V.; Kulkarni, G.; and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS.
[28]
Ren, M.; Kiros, R.; and Zemel, R. S. 2015. Exploring models and data for image question answering. arXiv 1505.02074.
[29]
Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556.
[30]
Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; and Ng, A. Y. 2014. Grounded compositional semantics for finding and describing images with sentences. In TACL.
[31]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR.
[32]
Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2014. Show and tell: A neural image caption generator. arXiv 1411.4555.
[33]
Wu, Z., and Palmer, M. S. 1994. Verb semantics and lexical selection. In ACL.
[34]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015a. Show, attend and tell: Neural image caption generation with visual attention. arXiv 1502.03044.
[35]
Xu, R.; Xiong, C.; Chen, W.; and Corso, J. 2015b. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI.

Cited By

View all
  • (2023)Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question AnsweringInternational Journal of Intelligent Information Technologies10.4018/IJIIT.31867119:1(1-19)Online publication date: 3-Mar-2023
  • (2023)Aggregated Co-attention based Visual Question AnsweringProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627659(1-10)Online publication date: 15-Dec-2023
  • (2021)How Far Have We Progressed in Identifying Self-admitted Technical Debts? A Comprehensive Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/344724730:4(1-56)Online publication date: 23-Jul-2021
  • Show More Cited By
  1. Learning to answer questions from image using convolutional neural network

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    AAAI'16: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
    February 2016
    4406 pages

    Sponsors

    • Association for the Advancement of Artificial Intelligence

    Publisher

    AAAI Press

    Publication History

    Published: 12 February 2016

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question AnsweringInternational Journal of Intelligent Information Technologies10.4018/IJIIT.31867119:1(1-19)Online publication date: 3-Mar-2023
    • (2023)Aggregated Co-attention based Visual Question AnsweringProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627659(1-10)Online publication date: 15-Dec-2023
    • (2021)How Far Have We Progressed in Identifying Self-admitted Technical Debts? A Comprehensive Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/344724730:4(1-56)Online publication date: 23-Jul-2021
    • (2020)Fine-Grained Similarity Measurement between Educational Videos and ExercisesProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413783(331-339)Online publication date: 12-Oct-2020
    • (2019)TAB-VCRProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455686(15615-15628)Online publication date: 8-Dec-2019
    • (2019)Success prediction on crowdfunding with multimodal deep learningProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367243.3367339(2158-2164)Online publication date: 10-Aug-2019
    • (2019)Visual Question Answering with Dynamic Parameter Prediction using Functional HashingProceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence10.1145/3366194.3366252(330-335)Online publication date: 20-Sep-2019
    • (2019)Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A PlatformsProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358000(1421-1430)Online publication date: 3-Nov-2019
    • (2019)Neural Network-based Detection of Self-Admitted Technical DebtACM Transactions on Software Engineering and Methodology10.1145/332491628:3(1-45)Online publication date: 29-Jul-2019
    • (2019)Stacked Self-Attention Networks for Visual Question AnsweringProceedings of the 2019 on International Conference on Multimedia Retrieval10.1145/3323873.3325044(207-211)Online publication date: 5-Jun-2019
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media