article

Exploiting long-term temporal dynamics for video captioning

Authors:

Lianli GaoAuthors Info & Claims

World Wide Web, Volume 22, Issue 2

Pages 735 - 749

https://doi.org/10.1007/s11280-018-0530-0

Published: 01 March 2019 Publication History

Abstract

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

References

[1]

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.5590 (2012)

[2]

Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157---166 (1994)

Digital Library

[3]

Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25---36 (2004)

[4]

Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190---200 (2011)

[5]

Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555 (2014)

[6]

Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376---380 (2014)

[7]

Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677---691 (2017)

Digital Library

[8]

Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179---211 (1990)

[9]

Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15---29 (2010)

Digital Library

[10]

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27---30, 2016, pp. 1933---1941 (2016)

[11]

Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045---2055 (2017). https://doi.org/10.1109/TMM.2017.2729019

Digital Library

[12]

Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371---4379 (2015)

[13]

Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188---1194 (2016)

[14]

Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357---361 (2016)

Digital Library

[15]

Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372---380 (2012)

Digital Library

[16]

He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770---778 (2016)

[17]

Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989)

[18]

Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480---1487 (2011)

[19]

Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171---184 (2002)

[20]

Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278---2324 (1998)

[21]

Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1---8 (2008)

[22]

Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.00234 (2016)

[23]

Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.10667 (2017)

[24]

Mikolov, T., Karafiát, M., Burget, L., Cernock?, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045---1048 (2010)

[25]

Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8---13 2014, Montreal, Quebec, Canada, pp. 2204---2212 (2014)

[26]

Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029---1038 (2016)

[27]

Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594---4602 (2016)

[28]

Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017)

[29]

Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311---318 (2002)

Digital Library

[30]

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015)

[31]

Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433---440 (2013)

Digital Library

[32]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211---252 (2015)

Digital Library

[33]

Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15---18, 2010, Proceedings, Part III, pp. 92---101 (2010)

[34]

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735---1780 (1997)

Digital Library

[35]

Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8---13 2014, Montreal, Quebec, Canada, pp. 568---576 (2014)

[36]

Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014)

[37]

Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19---25, 2017, pp. 2737---2743 (2017)

[38]

Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175---187 (2018)

Digital Library

[39]

Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999---5011 (2016)

[40]

Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15---19, 2016, pp. 831---840 (2016)

Digital Library

[41]

Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.02112 (2017)

[42]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1---9 (2015)

[43]

Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000)

[44]

Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCV

[45]

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566---4575 (2015)

[46]

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534---4542 (2015)

Digital Library

[47]

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494---1504 (2015)

[48]

Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.05592 (2016)

[49]

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11---14, 2016, Proceedings, Part VIII, pp. 20---36 (2016)

[50]

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288---5296 (2016)

[51]

Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507---4515 (2015)

Digital Library

[52]

Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.01646 (2016)

[53]

Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584---4593 (2016)

[54]

Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017)

[55]

Zeiler, M.D.: ADADELTA: An adaptive learning rate method. CoRR arXiv:1212.5701 (2012)

[56]

Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143---152 (2013)

Digital Library

[57]

Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450---461 (2016)

[58]

Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737---3750 (2014)

Cited By

Tang PTan YXia J(2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14887-z
Wang SGao LLyu XGuo YZeng PSong JMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Dynamic Scene Graph Generation via Temporal Prior InferenceProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548324(5793-5801)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548324
Wang YLi KChen GZhang YGuo DWang M(2022)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 26-Sep-2022
https://dl.acm.org/doi/10.1007/s11280-022-01105-3
Show More Cited By

Recommendations

Bidirectional Long-Short Term Memory for Video Description
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Video captioning using boosted and parallel Long Short-Term Memory networks
Abstract
Video captioning and its integration with deep learning is one of the most challenging issues in the field of machine vision and artificial intelligence. In this paper, a new boosted and parallel architecture is proposed for video ...
Highlights
- A new boosted and parallel deep architecture is proposed for video captioning.
- ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web

World Wide Web Volume 22, Issue 2

March 2019

491 pages

ISSN:1386-145X

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang PTan YXia J(2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14887-z
Wang SGao LLyu XGuo YZeng PSong JMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Dynamic Scene Graph Generation via Temporal Prior InferenceProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548324(5793-5801)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548324
Wang YLi KChen GZhang YGuo DWang M(2022)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 26-Sep-2022
https://dl.acm.org/doi/10.1007/s11280-022-01105-3
Zeng PJing SSong JFan KLi XWe LGuo Y(2022)Relation-aware aggregation network with auxiliary guidance for text-based person searchWorld Wide Web10.1007/s11280-021-00953-925:4(1565-1582)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11280-021-00953-9
Islam SDash ASeum ARaj AHossain TShah F(2021)Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning MethodsSN Computer Science10.1007/s42979-021-00487-x2:2Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1007/s42979-021-00487-x
Guo YSong JGao LShen HWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)One-shot Scene Graph GenerationProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3414025(3090-3098)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3414025
Xiong CSu MSharif M(2019)IARNN-Based Semantic-Containing Double-Level Embedding Bi-LSTM for Question-and-Answer MatchingComputational Intelligence and Neuroscience10.1155/2019/60748402019Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1155/2019/6074840
Benuwa BZhan YLiu JGou JGhansah BAnsah E(2019)Group sparse based locality --- sensitive dictionary learning for video semantic analysisMultimedia Tools and Applications10.1007/s11042-018-6417-378:6(6721-6744)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11042-018-6417-3

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents