research-article

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Authors:

Yongdong Zhang,

Feng WuAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1416 - 1424

https://doi.org/10.1145/3240508.3240632

Published: 15 October 2018 Publication History

Abstract

Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias'' during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (eg, visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (\eg, "man riding horse'') and comparisons (eg. "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention. Compared against traditional visual attention that only fixes a single image region at every step, CAVP can attend to complex visual compositions over time. The whole image captioning model --- CAVP and its subsequent language policy network --- can be efficiently optimized end-to-end by using an actor-critic policy gradient method with respect to any caption evaluation metric. We demonstrate the effectiveness of CAVP by state-of-the-art performances on MS-COCO offline split and online server, using various metrics and sensible visualizations of qualitative visual context. The code is available at \urlhttps://github.com/daqingliu/CAVP

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV .

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998 (2017).

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV .

Digital Library

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR .

[5]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization .

[6]

Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. 2017a. Structcap: Structured semantic embedding for image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 46--54.

Digital Library

[7]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017b. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR .

[8]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In CVPR .

[9]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et almbox. 2015. From captions to visual concepts and back. (2015).

[10]

Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2017. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376 (2017).

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .

[12]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).

[13]

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV .

[14]

Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In ECCV .

[15]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Inferring and Executing Programs for Visual Reasoning. In ICCV .

[16]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR .

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2017).

Digital Library

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .

Digital Library

[20]

Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences (2017).

[21]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).

[22]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV .

[23]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2016. Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370 (2016).

[24]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. {n. d.}. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR .

[25]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).

[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL .

Digital Library

[27]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).

[28]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS .

Digital Library

[29]

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899 (2017).

[30]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563 (2016).

[31]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. IJCV (2015).

Digital Library

[32]

Craig Stanfill and David Waltz. 1986. Toward memory-based reasoning. Commun. ACM (1986).

Digital Library

[33]

Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction .MIT press Cambridge.

[34]

Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 .

Digital Library

[35]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR .

[36]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .

[37]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning .

[38]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR .

[39]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .

Digital Library

[40]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting image captioning with attributes. OpenReview (2016).

[41]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI .

[42]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017a. Visual translation embedding network for visual relation detection. In CVPR .

[43]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding Referring Expressions in Images by Variational Context. In CVPR .

[44]

Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M Hospedales. 2017b. Actor-Critic Sequence Training for Image Captioning. arXiv preprint arXiv:1706.09601 (2017).

Cited By

Ruan JWu YWan XZhu Y(2024)Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00560(5689-5698)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00560
Li GYe HQi YWang SQing LHuang QYang M(2024)Learning Hierarchical Modular Networks for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332767746:2(1049-1064)Online publication date: Mar-2024
https://doi.org/10.1109/TPAMI.2023.3327677
Wu SFu XWu FZha Z(2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3358112
Show More Cited By

Index Terms

Context-Aware Visual Policy Network for Sequence-Level Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Context-Aware Visual Policy Network for Fine-Grained Image Captioning
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, ...
Image captioning improved visual question answering
Abstract
Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. ...
Learning visual relationship and context-aware attention for image captioning
Highlights
- We are the first to implicitly model the visual relationship among the objects in an image with a graph neural network.
- We propose a novel visual context-aware attention mechanism to select salient visual information for sentence ...
Abstract
Image captioning which automatically generates natural language descriptions for images has attracted lots of research attentions and there have been substantial progresses with attention based captioning methods. However, most attention-based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
394
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ruan JWu YWan XZhu Y(2024)Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00560(5689-5698)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00560
Li GYe HQi YWang SQing LHuang QYang M(2024)Learning Hierarchical Modular Networks for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332767746:2(1049-1064)Online publication date: Mar-2024
https://doi.org/10.1109/TPAMI.2023.3327677
Wu SFu XWu FZha Z(2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3358112
Wang LLi HZhang MQiu HMeng FWu QXu L(2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3328189
Zhang YLin CCao DLin D(2024)End-To-End Spatially-Constrained Multi-Perspective Fine-Grained Image CaptioningICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445846(3360-3364)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445846
Li SGong CZhu YLuo CHong YLv X(2024)Context-aware Multi-level Question Embedding Fusion for visual question answeringInformation Fusion10.1016/j.inffus.2023.102000102(102000)Online publication date: Mar-2024
https://doi.org/10.1016/j.inffus.2023.102000
Xiao FZhang NXue WGao X(2024)Sentinel mechanism for visual semantic graph-based image captioningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109626119(109626)Online publication date: Nov-2024
https://doi.org/10.1016/j.compeleceng.2024.109626
Vaishnavi JNarmatha V(2024)Video captioning – a surveyMultimedia Tools and Applications10.1007/s11042-024-18886-6Online publication date: 9-Apr-2024
https://doi.org/10.1007/s11042-024-18886-6
Zhang YLin CCao DLin D(2024)A Fine-Grained Image Description Generation Method Based on Joint ObjectivesComputer Supported Cooperative Work and Social Computing10.1007/978-981-99-9637-7_3(32-46)Online publication date: 5-Jan-2024
https://doi.org/10.1007/978-981-99-9637-7_3
Chen FLi XTang JLi SWang T(2024)Benefit from AMR: Image Captioning with Explicit Relations and Endogenous KnowledgeWeb and Big Data10.1007/978-981-97-2390-4_25(363-376)Online publication date: 28-Apr-2024
https://doi.org/10.1007/978-981-97-2390-4_25
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents