Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3123266.3123335acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

More Than An Answer: Neural Pivot Network for Visual Qestion Answering

Published: 19 October 2017 Publication History

Abstract

Most of existing works in visual question answering (VQA) are dedicated to improving the performance of answer predictions, while leaving the explanation of answering unexploited. We argue that, exploiting the explanations of question answering not only makes VQA explainable, but also quantitatively improves the prediction performance. In this paper, we propose a novel network architecture, termed Neural Pivot Network (NPN), towards simultaneous VQA and generating explanations in a multi-task learning architecture. NPN is trained by using both image-caption and image-question-answer pairs. In principle, CNN-based deep visual features are extracted and sent to both the VQA channel and the captioning module, the latter of which serves as a pivot to bridge the source image module to the target QA predictor. Such an innovative design enables us to introduce large-scale image-captioning training sets, e.g., MS-COCO Caption and Visual Genome Caption, together with cutting-edge image captioning models to benefit VQA learning. Quantitatively, the proposed NPN performs significantly better than alternatives and state-of-the-art schemes trained on VQA datasets only. Besides, by investigating the by-product of experiments, in-depth digests can be provided along with the answers.

References

[1]
Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, Nov (2005), 1817--1853.
[2]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016).
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[4]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[5]
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-Task Learning for Multiple Language Translation. In ACL (1). 1723--1732.
[6]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European conference on computer vision. Springer, 15--29.
[7]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).
[8]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317--326.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[10]
Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016).
[11]
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361--369.
[12]
Jin-Hwa Kim, Kyoung-Woon On, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).
[13]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[14]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Multimodal Neural Language Models. In Icml, Vol. 14. 595--603.
[15]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).
[16]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.
[17]
Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359--368.
[18]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289--297.
[19]
Lin Ma, Zhengdong Lu, and Hang Li. 2015. Learning to answer questions from image using convolutional neural network. arXiv preprint arXiv:1506.00333 (2015).
[20]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9.
[21]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).
[22]
Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras. 2016. Deep Multi- task Learning with Label Correlation Constraint for Video Concept Detection. In ACM on Multimedia Conference. 501--505.
[23]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38.
[24]
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961.
[25]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73.
[26]
Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. 2016. The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions. arXiv preprint arXiv:1612.05386 (2016).
[27]
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.
[28]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. arXiv 1603 (2016).
[29]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, Vol. 14. 77--81.
[30]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[31]
Xiao-Tong Yuan, Xiaobai Liu, and Shuicheng Yan. 2012. Visual classification with multitask joint sparse representation. IEEE Transactions on Image Processing 21, 10 (2012), 4349--4360.
[32]
Tianzhu Zhang, Bernard Ghanem, Si Liu, and Narendra Ahuja. 2013. Robust visual tracking via structured multi-task sparse learning. International journal of computer vision 101, 2 (2013), 367--383.
[33]
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015).
[34]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.

Cited By

View all
  • (2024)Few-Shot Multimodal Explanation for Visual Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681597(1875-1884)Online publication date: 28-Oct-2024
  • (2023)Multimodal Visual Question Answering Model Enhanced with Image Emotional Information2023 5th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP58431.2023.00056(268-273)Online publication date: Mar-2023
  • (2021)Meta Module Network for Compositional Visual Reasoning2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00070(655-664)Online publication date: Jan-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-task learning
  2. neural pivot network
  3. visual question answering

Qualifiers

  • Research-article

Funding Sources

  • Nature Science Foundation of China
  • National Key R&D Program

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,010 of 7,772 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Few-Shot Multimodal Explanation for Visual Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681597(1875-1884)Online publication date: 28-Oct-2024
  • (2023)Multimodal Visual Question Answering Model Enhanced with Image Emotional Information2023 5th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP58431.2023.00056(268-273)Online publication date: Mar-2023
  • (2021)Meta Module Network for Compositional Visual Reasoning2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00070(655-664)Online publication date: Jan-2021
  • (2020)Dynamic Context-guided Capsule Network for Multimodal Machine TranslationProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413715(1320-1329)Online publication date: 12-Oct-2020
  • (2019)Perceptual Visual Reasoning with Knowledge PropagationProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3350922(530-538)Online publication date: 15-Oct-2019
  • (2018)Examine before You AnswerProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240687(1742-1750)Online publication date: 15-Oct-2018
  • (2018)Collaborative Annotation of Semantic Objects in Images with Multi-granularity SupervisionsProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240540(474-482)Online publication date: 15-Oct-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media