research-article

More Than An Answer: Neural Pivot Network for Visual Qestion Answering

Authors:

Yunsheng WuAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 681 - 689

https://doi.org/10.1145/3123266.3123335

Published: 19 October 2017 Publication History

Abstract

Most of existing works in visual question answering (VQA) are dedicated to improving the performance of answer predictions, while leaving the explanation of answering unexploited. We argue that, exploiting the explanations of question answering not only makes VQA explainable, but also quantitatively improves the prediction performance. In this paper, we propose a novel network architecture, termed Neural Pivot Network (NPN), towards simultaneous VQA and generating explanations in a multi-task learning architecture. NPN is trained by using both image-caption and image-question-answer pairs. In principle, CNN-based deep visual features are extracted and sent to both the VQA channel and the captioning module, the latter of which serves as a pivot to bridge the source image module to the target QA predictor. Such an innovative design enables us to introduce large-scale image-captioning training sets, e.g., MS-COCO Caption and Visual Genome Caption, together with cutting-edge image captioning models to benefit VQA learning. Quantitatively, the proposed NPN performs significantly better than alternatives and state-of-the-art schemes trained on VQA datasets only. Besides, by investigating the by-product of experiments, in-depth digests can be provided along with the answers.

References

[1]

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, Nov (2005), 1817--1853.

Digital Library

[2]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016).

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.

Digital Library

[4]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[5]

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-Task Learning for Multiple Language Translation. In ACL (1). 1723--1732.

[6]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European conference on computer vision. Springer, 15--29.

Digital Library

[7]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).

[8]

Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317--326.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[10]

Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. 2016. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016).

[11]

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361--369.

[12]

Jin-Hwa Kim, Kyoung-Woon On, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).

[13]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Multimodal Neural Language Models. In Icml, Vol. 14. 595--603.

Digital Library

[15]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).

[16]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.

Digital Library

[17]

Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359--368.

Digital Library

[18]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289--297.

[19]

Lin Ma, Zhengdong Lu, and Hang Li. 2015. Learning to answer questions from image using convolutional neural network. arXiv preprint arXiv:1506.00333 (2015).

Digital Library

[20]

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9.

Digital Library

[21]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).

[22]

Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras. 2016. Deep Multi- task Learning with Label Correlation Constraint for Video Concept Detection. In ACM on Multimedia Conference. 501--505.

Digital Library

[23]

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38.

[24]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961.

Digital Library

[25]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73.

Digital Library

[26]

Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. 2016. The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions. arXiv preprint arXiv:1612.05386 (2016).

[27]

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.

[28]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. arXiv 1603 (2016).

[29]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, Vol. 14. 77--81.

Digital Library

[30]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[31]

Xiao-Tong Yuan, Xiaobai Liu, and Shuicheng Yan. 2012. Visual classification with multitask joint sparse representation. IEEE Transactions on Image Processing 21, 10 (2012), 4349--4360.

Digital Library

[32]

Tianzhu Zhang, Bernard Ghanem, Si Liu, and Narendra Ahuja. 2013. Robust visual tracking via structured multi-task sparse learning. International journal of computer vision 101, 2 (2013), 367--383.

Digital Library

[33]

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015).

[34]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.

Cited By

Xue DQian SXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Few-Shot Multimodal Explanation for Visual Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681597(1875-1884)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681597
Cai JCai G(2023)Multimodal Visual Question Answering Model Enhanced with Image Emotional Information2023 5th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP58431.2023.00056(268-273)Online publication date: Mar-2023
https://doi.org/10.1109/ICNLP58431.2023.00056
Chen WGan ZLi LCheng YWang WLiu J(2021)Meta Module Network for Compositional Visual Reasoning2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00070(655-664)Online publication date: Jan-2021
https://doi.org/10.1109/WACV48630.2021.00070
Show More Cited By

Index Terms

More Than An Answer: Neural Pivot Network for Visual Qestion Answering
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Adversarial Learning of Answer-Related Representation for Visual Question Answering
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Visual Question Answering (VQA) aims to learn a joint embedding of the question sentence and the corresponding image to infer the answer. Existing approaches learn the joint embedding don't consider the answer-related information, which results in that ...
Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to ...
Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Nature Science Foundation of China
National Key R&D Program

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,010 of 7,772 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
284
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xue DQian SXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Few-Shot Multimodal Explanation for Visual Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681597(1875-1884)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681597
Cai JCai G(2023)Multimodal Visual Question Answering Model Enhanced with Image Emotional Information2023 5th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP58431.2023.00056(268-273)Online publication date: Mar-2023
https://doi.org/10.1109/ICNLP58431.2023.00056
Chen WGan ZLi LCheng YWang WLiu J(2021)Meta Module Network for Compositional Visual Reasoning2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00070(655-664)Online publication date: Jan-2021
https://doi.org/10.1109/WACV48630.2021.00070
Lin HMeng FSu JYin YYang ZGe YZhou JLuo JWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Dynamic Context-guided Capsule Network for Multimodal Machine TranslationProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413715(1320-1329)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413715
Li GWang XZhu WAmsaleg LHuet BLarson MGravier GHung HNgo CTsang Ooi W(2019)Perceptual Visual Reasoning with Knowledge PropagationProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3350922(530-538)Online publication date: 15-Oct-2019
https://dl.acm.org/doi/10.1145/3343031.3350922
Gao LZeng PSong JLiu XShen HBoll SMu Lee KLuo JZhu WByun HWen Chen CLienhart RMei T(2018)Examine before You AnswerProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240687(1742-1750)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.1145/3240508.3240687
Zhang LFu CLi JBoll SMu Lee KLuo JZhu WByun HWen Chen CLienhart RMei T(2018)Collaborative Annotation of Semantic Objects in Images with Multi-granularity SupervisionsProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240540(474-482)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.1145/3240508.3240540

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents