research-article

Video Captioning With Attention-Based LSTM and Semantic Consistency

Authors:

Heng Tao ShenAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 19, Issue 9

Pages 2045 - 2055

https://doi.org/10.1109/TMM.2017.2729019

Published: 01 September 2017 Publication History

Abstract

Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time <inline-formula><tex-math notation="LaTeX">$t$</tex-math></inline-formula> and the word-embedding feature at time <inline-formula><tex-math notation="LaTeX">$t$</tex-math></inline-formula><inline-formula><tex-math notation="LaTeX"> $-$</tex-math></inline-formula>1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

References

[1]

J. Song et al., “Optimized graph learning using partial tags and multiple features for image and video annotation,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 4999–5011, Nov. 2016.

Digital Library

[2]

X. Zhu, Z. Huang, J. Cui, and H. T. Shen, “Video-to-shot tag propagation by graph sparse group lasso,” IEEE Trans. Multimedia, vol. 15, no. 3, pp. 633–646, Apr. 2013.

Digital Library

[3]

Z. Pan, Y. Zhang, and S. Kwong, “Efficient motion and disparity estimation optimization for low complexity multiview video coding,” IEEE Trans. Broadcast., vol. 61, no. 2, pp. 166–176, Jun. 2015.

[4]

T. Luong, H. Pham, and C. D. Manning, “ Effective approaches to attention-based neural machine translation,” in Proc. Empirical Methods Natural Lang. Process., 2015, pp. 1412–1421.

[5]

F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang, “Inductive hashing on manifolds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2013, pp. 1562–1569.

[6]

J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization-based hashing: A general framework for scalable image and video retrieval,” Pattern Recog., 2017.

[7]

C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei, “You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 923–932.

[8]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3156–3164.

[9]

Z. Guo et al., “Attention-based LSTM with semantic consistency for videos captioning,” in Proc. ACM Multimedia Conf., 2016, pp. 357 –361.

[10]

C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 1–9.

[11]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 2818–2826.

[12]

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” ICLR, 2015.

[13]

K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

[14]

A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Int. Conf. Neural Inform. Process. Syst., 2014, pp. 1889–1897.

[15]

X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proc. IEEE Int. Conf. Comput. Vis. , Dec. 2015, pp. 2407–2415.

[16]

S. Venugopalan et al., “Translating videos to natural language using deep recurrent neural networks,” in Proc. Conf. North Amer. Chapter Assoc., Comput. Linguistics, Human Lang. Technol., Denver, Colorado, USA, May 31–Jun. 5, 2015, pp. 1494–1504.

[17]

S. Venugopalan et al., “Sequence to sequence-video to text,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4534–4542.

[18]

L. Yao et al., “Describing videos by exploiting temporal structure,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4507–4515.

[19]

G. Li, S. Ma, and Y. Han, “Summarization-based video caption via deep neural networks,” in Proc. ACM Multimedia Conf., 2015, pp. 1191–1194.

[20]

Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick, “Image captioning with an intermediate attributes layer,” CoRR, 2015. [Online]. Available: http://arxiv.org/abs/1506.01144

[21]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

Digital Library

[22]

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 21–29.

[23]

L. Gao et al., “Graph-without-cut: An ideal graph learning for image segmentation,” in Proc. AAAI, 2016, pp. 1188– 1194.

[24]

J. Song et al., “Joint graph learning and video segmentation via multiple cues and topology calibration,” in Proc. ACM Multimedia Conf., 2016, pp. 831–840.

[25]

J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for large-scale near-duplicate video retrieval,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1997–2008, Dec. 2013.

Digital Library

[26]

L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using factorized spatio-temporal convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4597–4605.

[27]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4489–4497.

[28]

J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 2625–2634.

[29]

M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint language and translation modeling with recurrent neural networks,” in Proc. Conf. Empirical Methods Natural Lang. Process. , 2013, pp. 1044–1054.

[30]

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3128–3137.

[31]

J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016.

[32]

R. Kiros, R. Salakhutdinov, and R. S. Zemel, “ Unifying visual-semantic embeddings with multimodal neural language models,” CoRR, 2014. [Online]. Available: http://arxiv.org/abs/1411.2539

[33]

Z. Gan et al., “Semantic compositional networks for visual captioning,” CVPR, 2017.

[34]

X. Long, C. Gan, and G. de Melo, “Video captioning with multi-faceted attention,” CoRR, 2016. [Online]. Available: http://arxiv.org/abs/1612.00234

[35]

A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. German Conf. Pattern Recog., 2015, pp. 209–221.

[36]

K. Tran et al., “Rich image captioning in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, Jun.-Jul. 2016, pp. 49–56.

[37]

Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred semantic attributes,” CVPR, 2017.

[38]

L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” CVPR, 2017.

[39]

Q. Tian and S. Chen, “Cross-heterogeneous-database age estimation through correlation representation learning,” Neurocomputing, vol. 238, pp. 286– 295, 2017.

Digital Library

[40]

Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4594–4602.

[41]

P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 1029–1038.

[42]

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4651 –4659.

[43]

Z. Pan, P. Jin, J. Lei, Y. Zhang, X. Sun, and S. Kwong, “Fast reference frame selection based on content similarity for low complexity HEVC encoder,” J. Vis. Commun. Image Represent., vol. 40, pp. 516–524, 2016.

Digital Library

[44]

Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994.

Digital Library

[45]

D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics, Human Lang. Technol., 2011, pp. 190 –200.

Digital Library

[46]

J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. , Jun. 2016, pp. 5288–5296.

[47]

M. D. Zeiler, “Adadelta: An adaptive learning rate method,” CoRR, 2012. [Online]. Available: http://arxiv.org/abs/1212.5701

[48]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Assoc. Comput. Linguistics, 2002, pp. 311–318.

[49]

S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization , 2005, vol. 29, pp. 65–72.

[50]

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “ Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 4566–4575.

[51]

R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework,” in Proc. Assoc. Adv. Artif. Intell. , 2015, pp. 2346–2352.

[52]

H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4584–4593.

Cited By

Huh JKang JWoo JLee S(2024)A Novel Intelligent Video Surveillance System Using Low-Traffic Scene-Preserving Video AnonymizationACM Transactions on Intelligent Systems and Technology10.1145/370900116:2(1-24)Online publication date: 24-Dec-2024
https://dl.acm.org/doi/10.1145/3709001
Sun YLiu BChen XSong RFu J(2024)ViCo: Engaging Video Comment Generation with Human Preference RewardsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700260(1-1)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700260
Wang YZhu SXu HLuo DCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)An Inverse Partial Optimal Transport Framework for Music-guided Trailer GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680751(9739-9748)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680751
Show More Cited By

Index Terms

Video Captioning With Attention-Based LSTM and Semantic Consistency
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Attention-based LSTM with Semantic Consistency for Videos Captioning
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 19, Issue 9

Sept. 2017

164 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2017 IEEE.

Publisher

IEEE Press

Publication History

Published: 01 September 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

155
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huh JKang JWoo JLee S(2024)A Novel Intelligent Video Surveillance System Using Low-Traffic Scene-Preserving Video AnonymizationACM Transactions on Intelligent Systems and Technology10.1145/370900116:2(1-24)Online publication date: 24-Dec-2024
https://dl.acm.org/doi/10.1145/3709001
Sun YLiu BChen XSong RFu J(2024)ViCo: Engaging Video Comment Generation with Human Preference RewardsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700260(1-1)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700260
Wang YZhu SXu HLuo DCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)An Inverse Partial Optimal Transport Framework for Music-guided Trailer GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680751(9739-9748)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680751
Nguyen TLi ZSatoshi YLiu JGurrin CGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)A Parallel Transformer Framework for Video Moment RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658096(460-468)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658096
Kim SLee SKim KOh U(2024)Utilizing a Dense Video Captioning Technique for Generating Image Descriptions of Comics for People with Visual ImpairmentsProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645154(750-760)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645154
Ge SChen QJiang ZYin YChen ZGu QHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Short Video Ordering via Position Decoding and Successor PredictionProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657795(2167-2176)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657795
Yuan MJia GBao B(2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3330070
Jing SZhang HZeng PGao LSong JShen H(2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3295098
Yuan BSheng YBao BChen YXu C(2024)Semantic Distance Adversarial Learning for Text-to-Image SynthesisIEEE Transactions on Multimedia10.1109/TMM.2023.327899226(1255-1266)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3278992
Liu XYin JGuo DLiu H(2024)Rich Action-Semantic Consistent Knowledge for Early Action PredictionIEEE Transactions on Image Processing10.1109/TIP.2023.334573733(479-492)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3345737
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents