Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Published: 01 February 2022 Publication History

Abstract

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., “man riding horse”) and visual comparisons (e.g., “small(er) cat”). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model—CAVP and its subsequent language policy network—can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

References

[1]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, pp. 1735–1780, 1997.
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[3]
Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1151–1159.
[4]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Representations, arXiv:1409.0473, 2015.
[5]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
[6]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3156–3164.
[7]
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[8]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2425–2433.
[9]
Q. Zhang, Y. N. Wu, and S.-C. Zhu, “Interpretable convolutional neural networks,” Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8827–8836.
[10]
B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral Brain Sci., vol. 40, 2017, Art. no.
[11]
D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test for computer vision systems,” Proc. Nat. Acad. Sci. United States America, vol. 112, pp. 3618–3623, 2015.
[12]
H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3107–3115.
[13]
D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3097–3106.
[14]
L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 2852–2858.
[15]
J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Inferring and executing programs for visual reasoning,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3008–3017.
[16]
A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 727–739.
[17]
R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 804–813.
[18]
S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of SPIDEr,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 873–881.
[19]
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1179–1195.
[20]
L. Zhang, F. Sung, F. Liu, T. Xiang, S. Gong, Y. Yang, and T. M. Hospedales, “Actor-critic sequence training for image captioning,” in Proc. NIPS Workshop, arXiv:1706.09601, 2017.
[21]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.
[22]
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 382–398.
[23]
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, arXiv:1511.06732, 2016.
[24]
C. Stanfill and D. Waltz, “Toward memory-based reasoning,” Commun. ACM, vol. 29, pp. 1213–1228, 1986.
[25]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998.
[26]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[27]
J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchical approach for generating descriptive image paragraphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3337–3345.
[28]
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3242–3250.
[29]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and VQA,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.
[30]
L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6298–6306.
[31]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[32]
X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition GAN for visual paragraph generation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3382–3391.
[33]
M. Chatterjee and A. G. Schwing, “Diverse and coherent paragraph generation from images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 747–763.
[34]
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256, 1992.
[35]
H. Zhang, Y. Niu, and S.-F. Chang, “Grounding referring expressions in images by variational context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4158–4166.
[36]
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Lang. Technol., 2003, pp. 173–180.
[37]
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.
[38]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[39]
S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Translation Summarization, 2005, pp. 65–72.
[40]
C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004, pp. 74–81.
[41]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, arXiv:1412.6980, 2015.
[42]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.
[43]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, pp. 32–73, 2017.
[44]
T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4904–4912.
[45]
J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-fine learning for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 6837–6844.
[46]
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1473–1482.
[47]
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learn. Representations, arXiv:1412.6632, 2015.
[48]
J. Johnson, A. Karpathy, and L. Fei-Fei, “DenseCap: Fully convolutional localization networks for dense captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4565–4574.
[49]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proc. Int. Conf. Learn. Representations, arXiv:1409.1556, 2015.

Cited By

View all
  • (2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: 1-Aug-2024
  • (2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 24-Jan-2024
  • (2024)RISTRA: Recursive Image Super-Resolution Transformer With Relativistic AssessmentIEEE Transactions on Multimedia10.1109/TMM.2024.335240026(6475-6487)Online publication date: 10-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: 1-Aug-2024
  • (2024)Vision-and-Language Navigation via Latent Semantic Alignment LearningIEEE Transactions on Multimedia10.1109/TMM.2024.335811226(8406-8418)Online publication date: 24-Jan-2024
  • (2024)RISTRA: Recursive Image Super-Resolution Transformer With Relativistic AssessmentIEEE Transactions on Multimedia10.1109/TMM.2024.335240026(6475-6487)Online publication date: 10-Jan-2024
  • (2024)Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge GraphIEEE Transactions on Multimedia10.1109/TMM.2023.330127926(2659-2670)Online publication date: 1-Jan-2024
  • (2024)DDOD: Dive Deeper into the Disentanglement of Object DetectorIEEE Transactions on Multimedia10.1109/TMM.2023.326400826(284-298)Online publication date: 1-Jan-2024
  • (2024)Context-aware Multi-level Question Embedding Fusion for visual question answeringInformation Fusion10.1016/j.inffus.2023.102000102:COnline publication date: 1-Feb-2024
  • (2023)Improving the Accessibility of Screen-Shared Presentations by Enabling Concurrent ExplorationProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608411(1-16)Online publication date: 22-Oct-2023
  • (2023)MIEP: Channel Pruning with Multi-granular Importance Estimation for Object DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612563(2908-2917)Online publication date: 26-Oct-2023
  • (2023)Deconfounded Image Captioning: A Causal RetrospectIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.312170545:11(12996-13010)Online publication date: 1-Nov-2023
  • (2023)Multi-Branch Distance-Sensitive Self-Attention Network for Image CaptioningIEEE Transactions on Multimedia10.1109/TMM.2022.316906125(3962-3974)Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media