Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3240508.3240632acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Published: 15 October 2018 Publication History

Abstract

Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias'' during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (eg, visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (\eg, "man riding horse'') and comparisons (eg. "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention. Compared against traditional visual attention that only fixes a single image region at every step, CAVP can attend to complex visual compositions over time. The whole image captioning model --- CAVP and its subsequent language policy network --- can be efficiently optimized end-to-end by using an actor-critic policy gradient method with respect to any caption evaluation metric. We demonstrate the effectiveness of CAVP by state-of-the-art performances on MS-COCO offline split and online server, using various metrics and sensible visualizations of qualitative visual context. The code is available at \urlhttps://github.com/daqingliu/CAVP

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV .
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998 (2017).
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV .
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR .
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization .
[6]
Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. 2017a. Structcap: Structured semantic embedding for image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 46--54.
[7]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017b. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR .
[8]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In CVPR .
[9]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et almbox. 2015. From captions to visual concepts and back. (2015).
[10]
Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2017. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376 (2017).
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .
[12]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
[13]
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV .
[14]
Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In ECCV .
[15]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Inferring and Executing Programs for Visual Reasoning. In ICCV .
[16]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR .
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2017).
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .
[20]
Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences (2017).
[21]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV .
[23]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2016. Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370 (2016).
[24]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. {n. d.}. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR .
[25]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).
[26]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL .
[27]
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
[28]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS .
[29]
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899 (2017).
[30]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563 (2016).
[31]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. IJCV (2015).
[32]
Craig Stanfill and David Waltz. 1986. Toward memory-based reasoning. Commun. ACM (1986).
[33]
Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction .MIT press Cambridge.
[34]
Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 .
[35]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR .
[36]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR .
[37]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning .
[38]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR .
[39]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .
[40]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting image captioning with attributes. OpenReview (2016).
[41]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI .
[42]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017a. Visual translation embedding network for visual relation detection. In CVPR .
[43]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding Referring Expressions in Images by Variational Context. In CVPR .
[44]
Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M Hospedales. 2017b. Actor-Critic Sequence Training for Image Captioning. arXiv preprint arXiv:1706.09601 (2017).

Cited By

View all
  • (2024)Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00560(5689-5698)Online publication date: 3-Jan-2024
  • (2024)Learning Hierarchical Modular Networks for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332767746:2(1049-1064)Online publication date: Mar-2024
  • (2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Context-Aware Visual Policy Network for Sequence-Level Image Captioning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '18: Proceedings of the 26th ACM international conference on Multimedia
    October 2018
    2167 pages
    ISBN:9781450356657
    DOI:10.1145/3240508
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. image captioning
    2. policy network
    3. reinforcement learning
    4. visual context

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Key R&D Program of China

    Conference

    MM '18
    Sponsor:
    MM '18: ACM Multimedia Conference
    October 22 - 26, 2018
    Seoul, Republic of Korea

    Acceptance Rates

    MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Describe Images in a Boring Way: Towards Cross-Modal Sarcasm Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00560(5689-5698)Online publication date: 3-Jan-2024
    • (2024)Learning Hierarchical Modular Networks for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332767746:2(1049-1064)Online publication date: Mar-2024
    • (2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 2024
    • (2024)CrowdCaption++: Collective-Guided Crowd Scenes CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.332818926(4974-4986)Online publication date: 2024
    • (2024)End-To-End Spatially-Constrained Multi-Perspective Fine-Grained Image CaptioningICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445846(3360-3364)Online publication date: 14-Apr-2024
    • (2024)Context-aware Multi-level Question Embedding Fusion for visual question answeringInformation Fusion10.1016/j.inffus.2023.102000102(102000)Online publication date: Mar-2024
    • (2024)Sentinel mechanism for visual semantic graph-based image captioningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109626119(109626)Online publication date: Nov-2024
    • (2024)Video captioning – a surveyMultimedia Tools and Applications10.1007/s11042-024-18886-6Online publication date: 9-Apr-2024
    • (2024)A Fine-Grained Image Description Generation Method Based on Joint ObjectivesComputer Supported Cooperative Work and Social Computing10.1007/978-981-99-9637-7_3(32-46)Online publication date: 5-Jan-2024
    • (2024)Benefit from AMR: Image Captioning with Explicit Relations and Endogenous KnowledgeWeb and Big Data10.1007/978-981-97-2390-4_25(363-376)Online publication date: 28-Apr-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media