Article

Multi-level policy and reward reinforcement learning for image captioning

Authors:

Yongdong ZhangAuthors Info & Claims

IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Pages 821 - 827

Published: 13 July 2018 Publication History

Abstract

Image captioning is one of the most challenging hallmark of AI, due to its complexity in visual and natural language understanding. As it is essentially a sequential prediction task, recent advances in image captioning use Reinforcement Learning (RL) to better explore the dynamics of word-by-word generation. However, existing RL-based image captioning methods mainly rely on a single policy network and reward function that does not well fit the multi-level (word and sentence) and multimodal (vision and language) nature of the task. To this end, we propose a novel multi-level policy and reward RL framework for image captioning. It contains two modules: 1) Multi-Level Policy Network that can adaptively fuse the word-level policy and the sentence-level policy for the word generation; and 2) Multi-Level Reward Function that collaboratively leverages both vision-language reward and language-language reward to guide the policy. Further, we propose a guidance term to bridge the policy and the reward for RL optimization. Extensive experiments and analysis on MSCOCO and Flick- r30k show that the proposed framework can achieve competing performances with respect to different evaluation metrics.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In arXiv preprint arXiv:1707.07998, 2017.

[2]

Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan S. Kankanhalli. Aspect-aware latent factor model: Rating prediction with ratings and reviews. In WWW, pages 639-648, 2018.

Digital Library

[3]

Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via a conditional GAN. In ICCV, pages 2989-2998, 2017.

[4]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 1080-1089, 2017.

[5]

Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15-29, 2010.

Digital Library

[6]

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664-676, 2017.

Digital Library

[7]

Andrej Karpathy, Armand Joulin, and Fei-Fei Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889-1897, 2014.

Digital Library

[8]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. 2015.

[9]

Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. Image caption with global-local attention. In AAAI, pages 4133-4139, 2017.

[10]

An-An Liu, Ning Xu, Yongkang Wong, Junnan Li, Yuting Su, and Mohan S. Kankanhalli. Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. CVIU, 163:113-125, 2017.

Digital Library

[11]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, pages 873-881, 2017.

[12]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, pages 3242-3250, 2017.

[13]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.

[14]

Liqiang Nie, Meng Wang, Yue Gao, Zheng-Jun Zha, and Tat-Seng Chua. Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans. Multimedia, 15(2):426-441, 2013.

Digital Library

[15]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594-4602, 2016.

[16]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.

[17]

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning-based image captioning with embedding reward. In CVPR, pages 1151-1159, 2017.

[18]

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Selfcritical sequence training for image captioning. In CVPR, pages 1179-1195, 2017.

[19]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016.

[20]

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057-1063, 1999.

Digital Library

[21]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566-4575, 2015.

[22]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156-3164, 2015.

[23]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. What value do explicit high level concepts have in vision to language problems? In CVPR, pages 203-212, 2016.

[24]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048-2057, 2015.

Digital Library

[25]

Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W. Cohen. Encode, review, and decode: Reviewer module for caption generation. In NIPS, 2016.

Digital Library

[26]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes. In ICCV, 2017.

[27]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, pages 4651-4659, 2016.

[28]

Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi. Action-decision networks for visual tracking with deep reinforcement learning. In CVPR, pages 1349-1358, 2017.

[29]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In CVPR, pages 3107-3115, 2017.

Cited By

Cui CLin PNie XJian MYin Y(2020)Social-sensed Image Aesthetics AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341484316:3s(1-19)Online publication date: 31-Dec-2020
https://dl.acm.org/doi/10.1145/3414843

Recommendations

Image captioning based on deep reinforcement learning
ICIMCS '18: Proceedings of the 10th International Conference on Internet Multimedia Computing and Service

Recently it has shown that the policy-gradient methods for reinforcement learning have been utilized to train deep end-to-end systems on natural language processing tasks. What's more, with the complexity of understanding image content and diverse ways ...
Hierarchical Average Reward Reinforcement Learning

Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

July 2018

5885 pages

ISBN:9780999241127

Editor:
Jérôme Lang

Sponsors

Adobe
IBMR: IBM Research
ERICSSON
Microsoft: Microsoft
AI Journal: AI Journal

Publisher

AAAI Press

Publication History

Published: 13 July 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cui CLin PNie XJian MYin Y(2020)Social-sensed Image Aesthetics AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341484316:3s(1-19)Online publication date: 31-Dec-2020
https://dl.acm.org/doi/10.1145/3414843

View Options

View options

Media

Figures

Other

Tables

View Table of Contents