Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Residual attention-based LSTM for video captioning

Published: 01 March 2019 Publication History

Abstract

Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number of LSTM layers increasing, accuracy gets saturated and then degrades rapidly like standard deep convolutional networks such as VGG. In this paper, we propose a novel attention-based framework, namely Residual Attention-based LSTM (Res-ATT), which not only takes advantage of existing attention mechanism but also considers the importance of sentence internal information which usually gets lost in the transmission process. Our key novelty is that we show how to integrate residual mapping into a hierarchical LSTM network to solve the degradation problem. More specifically, our novel hierarchical architecture builds on two LSTMs layers and residual mapping is introduced to avoid the loss of previous generated words information (i.e., both content information and relationship information). Experimental results on the mainstream datasets: MSVD and MSR-VTT, which shows that our framework outperforms the state-of-the-art approaches. Furthermore, our automatically generated sentences can provide more detailed information to precisely describe a video.

References

[1]
Banerjee, S., Lavie, A.: Meteor: an Automatic Metric for Mt Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization, Vol. 29, pp. 65---72 (2005)
[2]
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From Captions to Visual Concepts and Back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473---1482 (2015)
[3]
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045---2055 (2017)
[4]
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712---2719 (2013)
[5]
He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[6]
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735---1780 (1997)
[7]
Kafle, K., Kanan, C.: Answer-Type Prediction for Visual Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976---4984 (2016)
[8]
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171---184 (2002)
[9]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, pp. 1097---1105 (2012)
[10]
Li, X., Gao, L., Xu, X., Shao, J., Shen, F., Song, J.: Kernel based latent semantic sparse hashing for large-scale retrieval from heterogeneous data sources. Neurocomputing 253, 89---96 (2017)
[11]
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632 (2014)
[12]
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029---1038 (2016)
[13]
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594---4602 (2016)
[14]
Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. arXiv:1611.07675 (2016)
[15]
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40Th Annual Meeting on Association for Computational Linguistics, pp. 311---318. Association for Computational Linguistics (2002)
[16]
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433---440 (2013)
[17]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
[18]
Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: IJCAI, pp. 2737---2743 (2017)
[19]
Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 175---187 (2018)
[20]
Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999---5011 (2016)
[21]
Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. arXiv:1707.02112 (2017)
[22]
Song, J., Shen, H.T., Wang, J., Huang, Z., Sebe, N., Wang, J.: A distance-computation-free search scheme for binary code databases. IEEE Trans. Multimed. 18(3), 484---495 (2016)
[23]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1---9 (2015)
[24]
Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv:1708.02711 (2017)
[25]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489---4497. IEEE (2015)
[26]
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566---4575 (2015)
[27]
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to Sequence-Video to Text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534---4542 (2015)
[28]
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729 (2014)
[29]
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and Tell: a Neural Image Caption Generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156---3164 (2015)
[30]
Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
[31]
Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.: Image captioning with an intermediate attributes layer. arXiv:1506.01144 (2015)
[32]
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-Vtt: a Large Video Description Dataset for Bridging Video and Language. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[33]
Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In: AAAI, vol. 5, p. 6 (2015)
[34]
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing Videos by Exploiting Temporal Structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507---4515 (2015)
[35]
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584---4593 (2016)
[36]
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv:1212.5701 (2012)
[37]
Zhang, H., Wang, M., Hong, R., Chua, T.S.: Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 781---790. ACM (2016)
[38]
Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143---152 (2013)
[39]
Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450---461 (2016)
[40]
Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737---3750 (2014)

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web
World Wide Web  Volume 22, Issue 2
March 2019
491 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2019

Author Tags

  1. Attention mechanism
  2. LSTM
  3. Residual thought
  4. Video captioning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Graph convolutional network for difficulty-controllable visual question generationWorld Wide Web10.1007/s11280-023-01202-x26:6(3735-3757)Online publication date: 1-Nov-2023
  • (2023)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 1-Jul-2023
  • (2023)Video description: A comprehensive survey of deep learning approachesArtificial Intelligence Review10.1007/s10462-023-10414-656:11(13293-13372)Online publication date: 1-Nov-2023
  • (2022)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 26-May-2022
  • (2022)Relation-aware aggregation network with auxiliary guidance for text-based person searchWorld Wide Web10.1007/s11280-021-00953-925:4(1565-1582)Online publication date: 1-Jul-2022
  • (2022)RETRACTED ARTICLE: Video captioning: a review of theory, techniques and practicesMultimedia Tools and Applications10.1007/s11042-021-11878-w81:25(35619-35653)Online publication date: 1-Oct-2022
  • (2022)Attention, please! A survey of neural attention models in deep learningArtificial Intelligence Review10.1007/s10462-022-10148-x55:8(6037-6124)Online publication date: 1-Dec-2022
  • (2022)IARG: Improved Actor Relation Graph Based Group Activity RecognitionSmart Multimedia10.1007/978-3-031-22061-6_3(29-40)Online publication date: 25-Aug-2022
  • (2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 30-Aug-2021
  • (2021)Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning MethodsSN Computer Science10.1007/s42979-021-00487-x2:2Online publication date: 1-Apr-2021
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media