research-article

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Authors:

Yongdong Zhang,

Feng WuAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 44, Issue 2

Pages 710 - 722

https://doi.org/10.1109/TPAMI.2019.2909864

Published: 01 February 2022 Publication History

Abstract

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., “man riding horse”) and visual comparisons (e.g., “small(er) cat”). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model—CAVP and its subsequent language policy network—can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

References

[1]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, pp. 1735–1780, 1997.

Digital Library

[2]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[3]

Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1151–1159.

[4]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Representations, arXiv:1409.0473, 2015.

[5]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.

[6]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3156–3164.

[7]

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

[8]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2425–2433.

[9]

Q. Zhang, Y. N. Wu, and S.-C. Zhu, “Interpretable convolutional neural networks,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8827–8836.

[10]

B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral Brain Sci., vol. 40, 2017, Art. no.

[11]

D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test for computer vision systems,” Proc. Nat. Acad. Sci. United States America, vol. 112, pp. 3618–3623, 2015.

[12]

H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3107–3115.

[13]

D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3097–3106.

[14]

L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 2852–2858.

[15]

J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3008–3017.

[16]

A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 727–739.

[17]

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 804–813.

[18]

S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of SPIDEr,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 873–881.

[19]

S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1179–1195.

[20]

L. Zhang, F. Sung, F. Liu, T. Xiang, S. Gong, Y. Yang, and T. M. Hospedales, “Actor-critic sequence training for image captioning,” in Proc. NIPS Workshop, arXiv:1706.09601, 2017.

[21]

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.

[22]

P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 382–398.

[23]

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, arXiv:1511.06732, 2016.

[24]

C. Stanfill and D. Waltz, “Toward memory-based reasoning,” Commun. ACM, vol. 29, pp. 1213–1228, 1986.

Digital Library

[25]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998.

Digital Library

[26]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.

[27]

J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchical approach for generating descriptive image paragraphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3337–3345.

[28]

J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3242–3250.

[29]

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and VQA,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.

[30]

L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6298–6306.

[31]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[32]

X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition GAN for visual paragraph generation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3382–3391.

[33]

M. Chatterjee and A. G. Schwing, “Diverse and coherent paragraph generation from images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 747–763.

[34]

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256, 1992.

Digital Library

[35]

H. Zhang, Y. Niu, and S.-F. Chang, “Grounding referring expressions in images by variational context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4158–4166.

[36]

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Lang. Technol., 2003, pp. 173–180.

[37]

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.

[38]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.

[39]

S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Translation Summarization, 2005, pp. 65–72.

[40]

C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004, pp. 74–81.

[41]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, arXiv:1412.6980, 2015.

[42]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.

Digital Library

[43]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, pp. 32–73, 2017.

Digital Library

[44]

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4904–4912.

[45]

J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-fine learning for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 6837–6844.

[46]

H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1473–1482.

[47]

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learn. Representations, arXiv:1412.6632, 2015.

[48]

J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4565–4574.

[49]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proc. Int. Conf. Learn. Representations, arXiv:1409.1556, 2015.

Cited By

Ma JWang PKong DWang ZLiu JPei HZhao J(2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3366154
Wu SFu XWu FZha Z(2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3358112
Zhou XHuang HWang ZHe R(2024)RISTRA: Recursive Image Super-Resolution Transformer With Relativistic AssessmentIEEE Transactions on Multimedia10.1109/TMM.2024.335240026(6475-6487)Online publication date: 10-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3352400
Show More Cited By

Recommendations

Context-Aware Visual Policy Network for Sequence-Level Image Captioning
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias'' during training: ...
Image captioning improved visual question answering
Abstract
Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. ...
Visual enhanced gLSTM for image captioning
Highlights
- A visual enhanced guiding long short-term memory is proposed for image captioning.
Abstract
For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

0162-8828 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma JWang PKong DWang ZLiu JPei HZhao J(2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3366154
Wu SFu XWu FZha Z(2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3358112
Zhou XHuang HWang ZHe R(2024)RISTRA: Recursive Image Super-Resolution Transformer With Relativistic AssessmentIEEE Transactions on Multimedia10.1109/TMM.2024.335240026(6475-6487)Online publication date: 10-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3352400
Zhao WWu X(2024)Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge GraphIEEE Transactions on Multimedia10.1109/TMM.2023.330127926(2659-2670)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3301279
Chen ZYang CChang JZhao FZha ZWu F(2024)DDOD: Dive Deeper into the Disentanglement of Object DetectorIEEE Transactions on Multimedia10.1109/TMM.2023.326400826(284-298)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3264008
Li SGong CZhu YLuo CHong YLv X(2024)Context-aware Multi-level Question Embedding Fusion for visual question answeringInformation Fusion10.1016/j.inffus.2023.102000102:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102000
Fan DJunuzovic STang JJaeger T(2023)Improving the Accessibility of Screen-Shared Presentations by Enabling Concurrent ExplorationProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608411(1-16)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3608411
Jiang LChen JHuang DWang YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)MIEP: Channel Pruning with Multi-granular Importance Estimation for Object DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612563(2908-2917)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612563
Yang XZhang HCai J(2023)Deconfounded Image Captioning: A Causal RetrospectIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.312170545:11(12996-13010)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TPAMI.2021.3121705
Ji JHuang XSun XZhou YLuo GCao LLiu JShao LJi R(2023)Multi-Branch Distance-Sensitive Self-Attention Network for Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2022.316906125(3962-3974)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3169061
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents