Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Video Captioning With Attention-Based LSTM and Semantic Consistency

Published: 01 September 2017 Publication History

Abstract

Recent progress in using long short-term memory (LSTM) for image captioning has motivated the exploration of their applications for video captioning. By taking a video as a sequence of features, an LSTM model is trained on video-sentence pairs and learns to associate a video to a sentence. However, most existing methods compress an entire video shot or frame into a static representation, without considering attention mechanism which allows for selecting salient features. Furthermore, existing approaches usually model the translating error, but ignore the correlations between sentence semantics and visual content. To tackle these issues, we propose a novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences. This framework integrates attention mechanism with LSTM to capture salient structures of video, and explores the correlation between multimodal representations (i.e., words and visual content) for generating sentences with rich semantic content. Specifically, we first propose an attention mechanism that uses the dynamic weighted sum of local two-dimensional convolutional neural network representations. Then, an LSTM decoder takes these visual features at time <inline-formula><tex-math notation="LaTeX">$t$</tex-math></inline-formula> and the word-embedding feature at time <inline-formula><tex-math notation="LaTeX">$t$</tex-math></inline-formula><inline-formula><tex-math notation="LaTeX"> $-$</tex-math></inline-formula>1 to generate important words. Finally, we use multimodal embedding to map the visual and sentence features into a joint space to guarantee the semantic consistence of the sentence description and the video visual content. Experiments on the benchmark datasets demonstrate that our method using single feature can achieve competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

References

[1]
J. Song et al., “Optimized graph learning using partial tags and multiple features for image and video annotation,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 4999–5011, Nov. 2016.
[2]
X. Zhu, Z. Huang, J. Cui, and H. T. Shen, “Video-to-shot tag propagation by graph sparse group lasso,” IEEE Trans. Multimedia, vol. 15, no. 3, pp. 633–646, Apr. 2013.
[3]
Z. Pan, Y. Zhang, and S. Kwong, “Efficient motion and disparity estimation optimization for low complexity multiview video coding,” IEEE Trans. Broadcast., vol. 61, no. 2, pp. 166–176, Jun. 2015.
[4]
T. Luong, H. Pham, and C. D. Manning, “ Effective approaches to attention-based neural machine translation,” in Proc. Empirical Methods Natural Lang. Process., 2015, pp. 1412–1421.
[5]
F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang, “Inductive hashing on manifolds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2013, pp. 1562–1569.
[6]
J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization-based hashing: A general framework for scalable image and video retrieval,” Pattern Recog., 2017.
[7]
C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei, “You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 923–932.
[8]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3156–3164.
[9]
Z. Guo et al., “Attention-based LSTM with semantic consistency for videos captioning,” in Proc. ACM Multimedia Conf., 2016, pp. 357 –361.
[10]
C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 1–9.
[11]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 2818–2826.
[12]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” ICLR, 2015.
[13]
K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[14]
A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Int. Conf. Neural Inform. Process. Syst., 2014, pp. 1889–1897.
[15]
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proc. IEEE Int. Conf. Comput. Vis. , Dec. 2015, pp. 2407–2415.
[16]
S. Venugopalan et al., “Translating videos to natural language using deep recurrent neural networks,” in Proc. Conf. North Amer. Chapter Assoc., Comput. Linguistics, Human Lang. Technol., Denver, Colorado, USA, May 31–Jun. 5, 2015, pp. 1494–1504.
[17]
S. Venugopalan et al., “Sequence to sequence-video to text,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4534–4542.
[18]
L. Yao et al., “Describing videos by exploiting temporal structure,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4507–4515.
[19]
G. Li, S. Ma, and Y. Han, “Summarization-based video caption via deep neural networks,” in Proc. ACM Multimedia Conf., 2015, pp. 1191–1194.
[20]
Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick, “Image captioning with an intermediate attributes layer,” CoRR, 2015. [Online]. Available: http://arxiv.org/abs/1506.01144
[21]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[22]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 21–29.
[23]
L. Gao et al., “Graph-without-cut: An ideal graph learning for image segmentation,” in Proc. AAAI, 2016, pp. 1188– 1194.
[24]
J. Song et al., “Joint graph learning and video segmentation via multiple cues and topology calibration,” in Proc. ACM Multimedia Conf., 2016, pp. 831–840.
[25]
J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for large-scale near-duplicate video retrieval,” IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1997–2008, Dec. 2013.
[26]
L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using factorized spatio-temporal convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4597–4605.
[27]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4489–4497.
[28]
J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 2625–2634.
[29]
M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint language and translation modeling with recurrent neural networks,” in Proc. Conf. Empirical Methods Natural Lang. Process. , 2013, pp. 1044–1054.
[30]
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 3128–3137.
[31]
J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016.
[32]
R. Kiros, R. Salakhutdinov, and R. S. Zemel, “ Unifying visual-semantic embeddings with multimodal neural language models,” CoRR, 2014. [Online]. Available: http://arxiv.org/abs/1411.2539
[33]
Z. Gan et al., “Semantic compositional networks for visual captioning,” CVPR, 2017.
[34]
X. Long, C. Gan, and G. de Melo, “Video captioning with multi-faceted attention,” CoRR, 2016. [Online]. Available: http://arxiv.org/abs/1612.00234
[35]
A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in Proc. German Conf. Pattern Recog., 2015, pp. 209–221.
[36]
K. Tran et al., “Rich image captioning in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, Jun.-Jul. 2016, pp. 49–56.
[37]
Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred semantic attributes,” CVPR, 2017.
[38]
L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” CVPR, 2017.
[39]
Q. Tian and S. Chen, “Cross-heterogeneous-database age estimation through correlation representation learning,” Neurocomputing, vol. 238, pp. 286– 295, 2017.
[40]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4594–4602.
[41]
P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 1029–1038.
[42]
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4651 –4659.
[43]
Z. Pan, P. Jin, J. Lei, Y. Zhang, X. Sun, and S. Kwong, “Fast reference frame selection based on content similarity for low complexity HEVC encoder,” J. Vis. Commun. Image Represent., vol. 40, pp. 516–524, 2016.
[44]
Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994.
[45]
D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics, Human Lang. Technol., 2011, pp. 190 –200.
[46]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. , Jun. 2016, pp. 5288–5296.
[47]
M. D. Zeiler, “Adadelta: An adaptive learning rate method,” CoRR, 2012. [Online]. Available: http://arxiv.org/abs/1212.5701
[48]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meet. Assoc. Comput. Linguistics, 2002, pp. 311–318.
[49]
S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization , 2005, vol. 29, pp. 65–72.
[50]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “ Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 4566–4575.
[51]
R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework,” in Proc. Assoc. Adv. Artif. Intell. , 2015, pp. 2346–2352.
[52]
H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 4584–4593.

Cited By

View all
  • (2024)An Inverse Partial Optimal Transport Framework for Music-guided Trailer GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680751(9739-9748)Online publication date: 28-Oct-2024
  • (2024)A Parallel Transformer Framework for Video Moment RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658096(460-468)Online publication date: 30-May-2024
  • (2024)Utilizing a Dense Video Captioning Technique for Generating Image Descriptions of Comics for People with Visual ImpairmentsProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645154(750-760)Online publication date: 18-Mar-2024
  • Show More Cited By

Index Terms

  1. Video Captioning With Attention-Based LSTM and Semantic Consistency
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Multimedia
    IEEE Transactions on Multimedia  Volume 19, Issue 9
    Sept. 2017
    164 pages

    Publisher

    IEEE Press

    Publication History

    Published: 01 September 2017

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Inverse Partial Optimal Transport Framework for Music-guided Trailer GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680751(9739-9748)Online publication date: 28-Oct-2024
    • (2024)A Parallel Transformer Framework for Video Moment RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658096(460-468)Online publication date: 30-May-2024
    • (2024)Utilizing a Dense Video Captioning Technique for Generating Image Descriptions of Comics for People with Visual ImpairmentsProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645154(750-760)Online publication date: 18-Mar-2024
    • (2024)Short Video Ordering via Position Decoding and Successor PredictionProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657795(2167-2176)Online publication date: 10-Jul-2024
    • (2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 1-Jan-2024
    • (2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
    • (2024)Semantic Distance Adversarial Learning for Text-to-Image SynthesisIEEE Transactions on Multimedia10.1109/TMM.2023.327899226(1255-1266)Online publication date: 1-Jan-2024
    • (2024)Rich Action-Semantic Consistent Knowledge for Early Action PredictionIEEE Transactions on Image Processing10.1109/TIP.2023.334573733(479-492)Online publication date: 1-Jan-2024
    • (2024)A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D ScenesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329688934:3(1322-1338)Online publication date: 1-Mar-2024
    • (2024)HACAN: a hierarchical answer-aware and context-aware network for question generationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2246-218:5Online publication date: 1-Oct-2024
    • Show More Cited By

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media