article

Residual attention-based LSTM for video captioning

Authors:

Lianli GaoAuthors Info & Claims

World Wide Web, Volume 22, Issue 2

Pages 621 - 636

https://doi.org/10.1007/s11280-018-0531-z

Published: 01 March 2019 Publication History

Abstract

Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number of LSTM layers increasing, accuracy gets saturated and then degrades rapidly like standard deep convolutional networks such as VGG. In this paper, we propose a novel attention-based framework, namely Residual Attention-based LSTM (Res-ATT), which not only takes advantage of existing attention mechanism but also considers the importance of sentence internal information which usually gets lost in the transmission process. Our key novelty is that we show how to integrate residual mapping into a hierarchical LSTM network to solve the degradation problem. More specifically, our novel hierarchical architecture builds on two LSTMs layers and residual mapping is introduced to avoid the loss of previous generated words information (i.e., both content information and relationship information). Experimental results on the mainstream datasets: MSVD and MSR-VTT, which shows that our framework outperforms the state-of-the-art approaches. Furthermore, our automatically generated sentences can provide more detailed information to precisely describe a video.

References

[1]

Banerjee, S., Lavie, A.: Meteor: an Automatic Metric for Mt Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization, Vol. 29, pp. 65---72 (2005)

Digital Library

[2]

Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From Captions to Visual Concepts and Back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473---1482 (2015)

[3]

Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045---2055 (2017)

[4]

Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712---2719 (2013)

Digital Library

[5]

He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

[6]

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735---1780 (1997)

Digital Library

[7]

Kafle, K., Kanan, C.: Answer-Type Prediction for Visual Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976---4984 (2016)

[8]

Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171---184 (2002)

[9]

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, pp. 1097---1105 (2012)

Digital Library

[10]

Li, X., Gao, L., Xu, X., Shao, J., Shen, F., Song, J.: Kernel based latent semantic sparse hashing for large-scale retrieval from heterogeneous data sources. Neurocomputing 253, 89---96 (2017)

[11]

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632 (2014)

[12]

Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029---1038 (2016)

[13]

Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594---4602 (2016)

[14]

Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. arXiv:1611.07675 (2016)

[15]

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40Th Annual Meeting on Association for Computational Linguistics, pp. 311---318. Association for Computational Linguistics (2002)

[16]

Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433---440 (2013)

Digital Library

[17]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

[18]

Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: IJCAI, pp. 2737---2743 (2017)

Digital Library

[19]

Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 175---187 (2018)

Digital Library

[20]

Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999---5011 (2016)

[21]

Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. arXiv:1707.02112 (2017)

[22]

Song, J., Shen, H.T., Wang, J., Huang, Z., Sebe, N., Wang, J.: A distance-computation-free search scheme for binary code databases. IEEE Trans. Multimed. 18(3), 484---495 (2016)

Digital Library

[23]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1---9 (2015)

[24]

Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv:1708.02711 (2017)

[25]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489---4497. IEEE (2015)

Digital Library

[26]

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566---4575 (2015)

[27]

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to Sequence-Video to Text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534---4542 (2015)

Digital Library

[28]

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729 (2014)

[29]

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and Tell: a Neural Image Caption Generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156---3164 (2015)

[30]

Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

[31]

Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.: Image captioning with an intermediate attributes layer. arXiv:1506.01144 (2015)

[32]

Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-Vtt: a Large Video Description Dataset for Bridging Video and Language. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

[33]

Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In: AAAI, vol. 5, p. 6 (2015)

[34]

Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing Videos by Exploiting Temporal Structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507---4515 (2015)

Digital Library

[35]

Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584---4593 (2016)

[36]

Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv:1212.5701 (2012)

[37]

Zhang, H., Wang, M., Hong, R., Chua, T.S.: Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 781---790. ACM (2016)

Digital Library

[38]

Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143---152 (2013)

Digital Library

[39]

Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450---461 (2016)

[40]

Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737---3750 (2014)

Cited By

Chen FXie JCai YLin ZLi QWang T(2023)Graph convolutional network for difficulty-controllable visual question generationWorld Wide Web10.1007/s11280-023-01202-x26:6(3735-3757)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s11280-023-01202-x
Wang YLi KChen GZhang YGuo DWang M(2023)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s11280-022-01105-3
Rafiq GRafiq MChoi G(2023)Video description: A comprehensive survey of deep learning approachesArtificial Intelligence Review10.1007/s10462-023-10414-656:11(13293-13372)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s10462-023-10414-6
Show More Cited By

Recommendations

Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Attention-based LSTM with Semantic Consistency for Videos Captioning
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Recent progress in using Long Short-Term Memory (LSTM) for image description has motivated the exploration of their applications for automatically describing video content with natural language sentences. By taking a video as a sequence of features, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web

World Wide Web Volume 22, Issue 2

March 2019

491 pages

ISSN:1386-145X

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen FXie JCai YLin ZLi QWang T(2023)Graph convolutional network for difficulty-controllable visual question generationWorld Wide Web10.1007/s11280-023-01202-x26:6(3735-3757)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s11280-023-01202-x
Wang YLi KChen GZhang YGuo DWang M(2023)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s11280-022-01105-3
Rafiq GRafiq MChoi G(2023)Video description: A comprehensive survey of deep learning approachesArtificial Intelligence Review10.1007/s10462-023-10414-656:11(13293-13372)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s10462-023-10414-6
Chen JPan YLi YYao TChao HMei T(2022)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3539225
Zeng PJing SSong JFan KLi XWe LGuo Y(2022)Relation-aware aggregation network with auxiliary guidance for text-based person searchWorld Wide Web10.1007/s11280-021-00953-925:4(1565-1582)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11280-021-00953-9
Jain VAl-Turjman FChaudhary GNayar DGupta VKumar A(2022)RETRACTED ARTICLE: Video captioning: a review of theory, techniques and practicesMultimedia Tools and Applications10.1007/s11042-021-11878-w81:25(35619-35653)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s11042-021-11878-w
de Santana Correia AColombini E(2022)Attention, please! A survey of neural attention models in deep learningArtificial Intelligence Review10.1007/s10462-022-10148-x55:8(6037-6124)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10462-022-10148-x
Kuang ZTie X(2022)IARG: Improved Actor Relation Graph Based Group Activity RecognitionSmart Multimedia10.1007/978-3-031-22061-6_3(29-40)Online publication date: 25-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-22061-6_3
Mogadala AKalimuthu MKlakow D(2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 30-Aug-2021
https://dl.acm.org/doi/10.1613/jair.1.11688
Islam SDash ASeum ARaj AHossain TShah F(2021)Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning MethodsSN Computer Science10.1007/s42979-021-00487-x2:2Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.1007/s42979-021-00487-x
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents