Article

Show, attend and tell: neural image caption generation with visual attention

Authors:

Aaron Courville,

Ruslan Salakhutdinov,

Richard S. Zemel,

Yoshua BengioAuthors Info & Claims

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

Pages 2048 - 2057

Published: 06 July 2015 Publication History

Abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

References

[1]

Ba, Jimmy Lei, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arXiv:1412.7755 [cs.LG], December 2014.

[2]

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs.CL], September 2014.

[3]

Baldi, Pierre and Sadowski, Peter. The dropout learning algorithm. Artificial intelligence, 210:78-122, 2014.

[4]

Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. Submited to the Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

[5]

Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.

[6]

Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.

[7]

Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, October 2014.

[8]

Corbetta, Maurizio and Shulman, Gordon L. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201-215, 2002.

[9]

Denil, Misha, Bazzani, Loris, Larochelle, Hugo, and de Freitas, Nando. Learning where to attend with deep architectures for image tracking. Neural Computation, 2012.

[10]

Denkowski, Michael and Lavie, Alon. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014.

[11]

Donahue, Jeff, Hendrikcs, Lisa Anne, Guadarrama, Segio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389v2 [cs.CV], November 2014.

[12]

Elliott, Desmond and Keller, Frank. Image description using visual dependency representations. In EMNLP, pp. 1292-1302, 2013.

[13]

Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv:1411.4952 [cs.CV], November 2014.

[14]

Graves, Alex. Generating sequences with recurrent neural networks. Technical report, arXiv preprint arXiv:1308.0850, 2013.

[15]

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.

[16]

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735-1780, 1997.

[17]

Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, pp. 853-899, 2013.

[18]

Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700-1709. Association for Computational Linguistics, 2013.

[19]

Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306 [cs.CV], December 2014.

[20]

Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG], December 2014.

[21]

Kingma, Durk P. and Welling, Max. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.

[22]

Kiros, Ryan, Salahutdinov, Ruslan, and Zemel, Richard. Multimodal neural language models. In International Conference on Machine Learning, pp. 595-603, 2014a.

[23]

Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs.LG], November 2014b.

[24]

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey. ImageNet classification with deep convolutional neural networks. In NIPS. 2012.

[25]

Kulkarni, Girish, Premraj, Visruth, Ordonez, Vicente, Dhar, Sagnik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg, Tamara L. Babytalk: Understanding and generating simple image descriptions. PAMI, IEEE Transactions on, 35(12):2891-2903, 2013.

[26]

Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C, Berg, Tamara L, and Choi, Yejin. Collective generation of natural image descriptions. In Association for Computational Linguistics: Long Papers, pp. 359-368. Association for Computational Linguistics, 2012.

[27]

Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and Choi, Yejin. Treetalk: Composition and compression of trees for image descriptions. TACL, 2(10):351-362, 2014.

[28]

Larochelle, Hugo and Hinton, Geoffrey E. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, pp. 1243-1251, 2010.

[29]

Li, Siming, Kulkarni, Girish, Berg, Tamara L, Berg, Alexander C, and Choi, Yejin. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning, pp. 220-228. Association for Computational Linguistics, 2011.

[30]

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, pp. 740-755. 2014.

[31]

Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632[cs.CV], December 2014.

[32]

Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daumé III, Hal. Midge: Generating image descriptions from computer vision detections. In European Chapter of the Association for Computational Linguistics, pp. 747-756. Association for Computational Linguistics, 2012.

[33]

Mnih, Volodymyr, Hees, Nicolas, Graves, Alex, and Kavukcuoglu, Koray. Recurrent models of visual attention. In NIPS, 2014.

[34]

Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to construct deep recurrent neural networks. In ICLR, 2014.

[35]

Rensink, Ronald A. The dynamic representation of scenes. Visual cognition, 7(1-3):17-42, 2000.

[36]

Rezende, Danilo J., Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082, 2014.

[37]

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.

[38]

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[39]

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, pp. 2951-2959, 2012.

[40]

Snoek, Jasper, Swersky, Kevin, Zemel, Richard S, and Adams, Ryan P. Input warping for bayesian optimization of nonstationary functions. arXiv preprint arXiv:1402.0929, 2014.

[41]

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929-1958, 2014.

[42]

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In NIPS, pp. 3104-3112, 2014.

[43]

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.

[44]

Tang, Yichuan, Srivastava, Nitish, and Salakhutdinov, Ruslan R. Learning generative models with visual attention. In NIPS, pp. 1808-1816, 2014.

[45]

Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5 - RMSProp. Technical report, 2012.

[46]

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv:1411.4555 [cs.CV], November 2014.

[47]

Weaver, Lex and Tao, Nigel. The optimal reward baseline for gradient-based reinforcement learning. In Proc. UAI'2001, pp. 538-545, 2001.

[48]

Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.

[49]

Yang, Yezhou, Teo, Ching Lik, Daumé III, Hal, and Aloimonos, Yiannis. Corpus-guided sentence generation of natural images. In EMNLP, pp. 444-454. Association for Computational Linguistics, 2011.

[50]

Yao, Li, Torabi, Atousa, Cho, Kyunghyun, Ballas, Nicolas, Pal, Christopher, Larochelle, Hugo, and Courville, Aaron. Describing videos by exploiting temporal structure. arXiv preprint arXiv:1502.08029, April 2015.

[51]

Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67-78, 2014.

[52]

Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, September 2014.

Cited By

Longa AAzzolin SSantin GCencetti GLio PLepri BPasserini A(2025)Explaining the Explainers in Graph Neural Networks: a Comparative StudyACM Computing Surveys10.1145/369644457:5(1-37)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3696444
Xu DJiang SZhang YLi X(2025)Psychological analysis of house-tree-person drawings based on multimodal large modelsMultimedia Systems10.1007/s00530-024-01610-731:1Online publication date: 8-Jan-2025
https://doi.org/10.1007/s00530-024-01610-7
Zhao CWang KZeng XZhao RChan ASalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Gradient-based visual explanation for transformer-based CLIPProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694595(61072-61091)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694595
Show More Cited By

Show, attend and tell: neural image caption generation with visual attention
1. Computing methodologies

Recommendations

Watch, Think and Attend: End-to-End Video Classification via Dynamic Knowledge Evolution Modeling
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Video classification has been achieved by automatically mining the underlying concepts (\eg actions, events) in videos, which plays an essential role in intelligent video analysis. However, most existing algorithms only exploit the visual cues of these ...
Show, reward and tell: automatic generation of narrative paragraph from photo stream by adversarial training
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence

Impressive image captioning results (i.e., an objective description for an image) are achieved with plenty of training pairs. In this paper, we take one step further to investigate the creation of narrative paragraph for a photo stream. This task is even ...
Sign, Attend and Tell: Spatial Attention for Sign Language Recognition
2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)
Sign Language Recognition (SLR) has witnessed a boost in recent years, particularly with the surge of deep learning techniques. However, most existing methods do not exploit the concept of attention mechanisms, despite their success in several computer ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

July 2015

2558 pages

Editors:
Francis Bach,
David Blei

Publisher

JMLR.org

Publication History

Published: 06 July 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

644
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Longa AAzzolin SSantin GCencetti GLio PLepri BPasserini A(2025)Explaining the Explainers in Graph Neural Networks: a Comparative StudyACM Computing Surveys10.1145/369644457:5(1-37)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3696444
Xu DJiang SZhang YLi X(2025)Psychological analysis of house-tree-person drawings based on multimodal large modelsMultimedia Systems10.1007/s00530-024-01610-731:1Online publication date: 8-Jan-2025
https://doi.org/10.1007/s00530-024-01610-7
Zhao CWang KZeng XZhao RChan ASalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Gradient-based visual explanation for transformer-based CLIPProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694595(61072-61091)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694595
Hu LLiu YLiu NHuai MSun LWang DSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Improving interpretation faithfulness for vision transformersProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692850(19344-19370)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692850
K S JB A(2024)Exploring GRU-based approaches with attention mechanisms for accurate phishing URL detectionIntelligent Decision Technologies10.3233/IDT-24002618:2(1029-1052)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/IDT-240026
Liu HLuo DXu HLarson K(2024)Inferring iterated function systems approximately from fractal imagesProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/852(7699-7707)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/852
Jin HChe HLin YChen HWooldridge MDy JNatarajan S(2024)PromptMRGProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i3.28038(2607-2615)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i3.28038
Chen KSun SZhao JWooldridge MDy JNatarajan S(2024)CaMILProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i2.27873(1120-1128)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i2.27873
Abdelrahman ESun PLi LElhoseiny MWooldridge MDy JNatarajan S(2024)ImageCaptioner2Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i19.30080(20902-20911)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i19.30080
Chen DZhuang YZhang SLiu JDong STang SWooldridge MDy JNatarajan S(2024)Data shuntProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i10.29003(11249-11257)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i10.29003
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten