Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Exploiting long-term temporal dynamics for video captioning

Published: 01 March 2019 Publication History

Abstract

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

References

[1]
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.5590 (2012)
[2]
Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157---166 (1994)
[3]
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25---36 (2004)
[4]
Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190---200 (2011)
[5]
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555 (2014)
[6]
Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376---380 (2014)
[7]
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677---691 (2017)
[8]
Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179---211 (1990)
[9]
Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15---29 (2010)
[10]
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27---30, 2016, pp. 1933---1941 (2016)
[11]
Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045---2055 (2017). https://doi.org/10.1109/TMM.2017.2729019
[12]
Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371---4379 (2015)
[13]
Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188---1194 (2016)
[14]
Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357---361 (2016)
[15]
Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372---380 (2012)
[16]
He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770---778 (2016)
[17]
Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989)
[18]
Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480---1487 (2011)
[19]
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171---184 (2002)
[20]
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278---2324 (1998)
[21]
Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1---8 (2008)
[22]
Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.00234 (2016)
[23]
Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.10667 (2017)
[24]
Mikolov, T., Karafiát, M., Burget, L., Cernock?, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045---1048 (2010)
[25]
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8---13 2014, Montreal, Quebec, Canada, pp. 2204---2212 (2014)
[26]
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029---1038 (2016)
[27]
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594---4602 (2016)
[28]
Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017)
[29]
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311---318 (2002)
[30]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015)
[31]
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433---440 (2013)
[32]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211---252 (2015)
[33]
Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15---18, 2010, Proceedings, Part III, pp. 92---101 (2010)
[34]
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735---1780 (1997)
[35]
Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8---13 2014, Montreal, Quebec, Canada, pp. 568---576 (2014)
[36]
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014)
[37]
Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19---25, 2017, pp. 2737---2743 (2017)
[38]
Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175---187 (2018)
[39]
Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999---5011 (2016)
[40]
Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15---19, 2016, pp. 831---840 (2016)
[41]
Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.02112 (2017)
[42]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1---9 (2015)
[43]
Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000)
[44]
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCV
[45]
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566---4575 (2015)
[46]
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534---4542 (2015)
[47]
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494---1504 (2015)
[48]
Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.05592 (2016)
[49]
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11---14, 2016, Proceedings, Part VIII, pp. 20---36 (2016)
[50]
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288---5296 (2016)
[51]
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507---4515 (2015)
[52]
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.01646 (2016)
[53]
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584---4593 (2016)
[54]
Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017)
[55]
Zeiler, M.D.: ADADELTA: An adaptive learning rate method. CoRR arXiv:1212.5701 (2012)
[56]
Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143---152 (2013)
[57]
Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450---461 (2016)
[58]
Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737---3750 (2014)

Cited By

View all
  • (2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 17-Mar-2023
  • (2022)Dynamic Scene Graph Generation via Temporal Prior InferenceProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548324(5793-5801)Online publication date: 10-Oct-2022
  • (2022)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 26-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image World Wide Web
World Wide Web  Volume 22, Issue 2
March 2019
491 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 March 2019

Author Tags

  1. Long-term temporal dynamics
  2. RNNs
  3. Video captioning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Deep sequential collaborative cognition of vision and language based model for video descriptionMultimedia Tools and Applications10.1007/s11042-023-14887-z82:23(36207-36230)Online publication date: 17-Mar-2023
  • (2022)Dynamic Scene Graph Generation via Temporal Prior InferenceProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548324(5793-5801)Online publication date: 10-Oct-2022
  • (2022)Spatiotemporal contrastive modeling for video moment retrievalWorld Wide Web10.1007/s11280-022-01105-326:4(1525-1544)Online publication date: 26-Sep-2022
  • (2022)Relation-aware aggregation network with auxiliary guidance for text-based person searchWorld Wide Web10.1007/s11280-021-00953-925:4(1565-1582)Online publication date: 1-Jul-2022
  • (2021)Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning MethodsSN Computer Science10.1007/s42979-021-00487-x2:2Online publication date: 27-Feb-2021
  • (2020)One-shot Scene Graph GenerationProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3414025(3090-3098)Online publication date: 12-Oct-2020
  • (2019)IARNN-Based Semantic-Containing Double-Level Embedding Bi-LSTM for Question-and-Answer MatchingComputational Intelligence and Neuroscience10.1155/2019/60748402019Online publication date: 1-Jan-2019
  • (2019)Group sparse based locality --- sensitive dictionary learning for video semantic analysisMultimedia Tools and Applications10.1007/s11042-018-6417-378:6(6721-6744)Online publication date: 1-Mar-2019

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media