Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Hierarchical & multimodal video captioning

Published: 01 October 2017 Publication History

Abstract

In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task.We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the dynamics within both visual and textual modalities for video caption task, which infers an arbitrary length sentence according to the input video with arbitrary number of frames.Specifically, we argue that the module for latent semantic discovery transfers external knowledge to generate complex and helpful complementary cues.We comprehensively evaluate the HMVC model on three captioning datasets and have attained a competitive performance.In addition, we evaluate the generalization properties of the proposed model by fine-tuning and evaluating the model on different datasets. To the best of our knowledge, this is the first time such analysis has been applied for the video caption task. Recently, video captioning has achieved significant progress through the advances of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Given a video, deep learning approach is applied to encode the visual information and generate the corresponding caption. However, this direct visual to textual translation ignores the rich intermediate description, such as objects, scenes, actions, etc. In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task. We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the dynamics within both visual and textual modalities for video caption task, which infers an arbitrary length sentence according to the input video with arbitrary number of frames. Specifically, we argue that the module for latent semantic discovery transfers external knowledge to generate complex and helpful complementary cues. We comprehensively evaluate the HMVC model on the Microsoft Video Description Corpus (MSVD), the MPII Movie Description Dataset (MPII-MD), and the novel dataset for 2016 MSR Video to Text challenge (MSR-VTT), and have attained a competitive performance. In addition, we evaluate the generalization properties of the proposed model by fine-tuning and evaluating the model on different datasets. To the best of our knowledge, this is the first time such analysis has been applied for the video caption task.

References

[1]
A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S.J. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J.M. Siskind, J.W. Waggoner, S. Wang, J. Wei, Y. Yin, Z. Zhang, Video in sentences out, 2012.
[2]
K. Barnard, P. Duygulu, D.A. Forsyth, N. de Freitas, D.M. Blei, M.I. Jordan, Matching words and pictures, JMLR, 3 (2003) 1107-1135.
[3]
D. Chen, W.B. Dolan, Collecting highly parallel data for paraphrase evaluation, 2011.
[4]
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollr, C.L. Zitnick, Microsoft COCO captions: data collection and evaluation server, CoRR, abs/1504.00325 (2015).
[5]
X. Chen, C.L. Zitnick, Minds eye: a recurrent visual representation for image caption generation, 2015.
[6]
Z. Cheng, J. Shen, On very large scale test collection for landmark image search benchmarking, Signal Process., 124 (2016) 13-26.
[7]
K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: encoder-decoder approaches, 2014.
[8]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Imagenet: a large-scale hierarchical image database, 2009.
[9]
M. Denkowski, A. Lavie, Meteor universal: language specific translation evaluation for any target language, 2014.
[10]
J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, 2015.
[11]
D. Elliott, F. Keller, Comparing automatic evaluation measures for image description, 2014.
[12]
H. Fang, S. Gupta, F.N. Iandola, R.K. Srivastava, L. Deng, P. Dollr, J. Gao, X. He, M. Mitchell, J.C. Platt, C.L. Zitnick, G. Zweig, From captions to visual concepts and back, 2015.
[13]
A. Farhadi, S.M.M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D.A. Forsyth, Every picture tells a story: generating sentences from images, Lect. Notes Comput. Sci., 6314 (2010) 15-29.
[14]
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R.J. Mooney, T. Darrell, K. Saenko, YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, 2013.
[15]
X. He, H. Zhang, M. Kan, T. Chua, Fast matrix factorization for online recommendation with implicit feedback, 2016.
[16]
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997) 1735-1780.
[17]
S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition, TPAMI, 35 (2013) 221-231.
[18]
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term memory model for image caption generation, 2015.
[19]
Y. Jia, M. Salzmann, T. Darrell, Learning cross-modality similarity for multinomial data, 2011.
[20]
Q. Jin, J. Chen, S. Chen, Y. Xiong, A.G. Hauptmann, Describing videos using multi-modal fusion, 2016.
[21]
J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: fully convolutional localization networks for dense captioning, 2016.
[22]
A. Karpathy, F. Li, Deep visual-semantic alignments for generating image descriptions, 2015.
[23]
R. Kiros, R. Salakhutdinov, R.S. Zemel, Multimodal neural language models, 2014.
[24]
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A.C. Berg, T.L. Berg, Babytalk: understanding and generating simple image descriptions, TPAMI, 35 (2013) 2891-2903.
[25]
P. Kuznetsova, V. Ordonez, T.L. Berg, Y. Choi, TREETALK: composition and compression of trees for image descriptions, TACL, 2 (2014) 351-362.
[26]
Q.V. Le, T. Mikolov, Distributed representations of sentences and documents, 2014.
[27]
R. Lebret, P.H.O. Pinheiro, R. Collobert, Phrase-based image captioning, 2015.
[28]
S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi, Composing simple image descriptions using web-scale n-grams, 2011.
[29]
C.-Y. Lin, ROUGE: a package for automatic evaluation of summaries, 2004.
[30]
T. Lin, M. Maire, S.J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, C.L. Zitnick, Microsoft COCO: common objects in context, Lect. Notes Comput. Sci., 8693 (2014) 740-755.
[31]
A. Liu, Y. Su, P. Jia, Z. Gao, T. Hao, Z. Yang, Multipe/single-view human action recognition via part-induced multitask structural learning, IEEE Trans. Cybern., 45 (2015) 1194-1208.
[32]
A. Liu, Y. Su, W. Nie, M.S. Kankanhalli, Hierarchical clustering multi-task learning for joint human action grouping and recognition, TPAMI, 39 (2017) 102-114.
[33]
A.A. Liu, N. Xu, W.Z. Nie, Y.T. Su, Y. Wong, M. Kankanhalli, Benchmarking a multimodal and multiview and interactive dataset for human action recognition, IEEE Trans. Cybern., PP (2016) 1-14.
[34]
X. Long, C. Gan, G. de Melo, Video captioning with multi-faceted attention, CoRR, abs/1612.00234 (2016).
[35]
J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, 2014.
[36]
J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), 2015.
[37]
L. Nie, M. Wang, Y. Gao, Z. Zha, T. Chua, Beyond text QA: multimedia answer generation by harvesting web information, IEEE Trans. Multimedia, 15 (2013) 426-441.
[38]
P. Pan, Z. Xu, Y. Yang, F. Wu, Y. Zhuang, Hierarchical recurrent neural encoder for video representation with application to captioning, 2016.
[39]
Y. Pan, T. Mei, T. Yao, H. Li, Y. Rui, Jointly modeling embedding and translation to bridge video and language, 2016.
[40]
Y. Pan, T. Yao, H. Li, T. Mei, Video captioning with transferred semantic attributes, CoRR, abs/1611.07675 (2016).
[41]
K. Papineni, S. Roukos, T. Ward, W. Zhu, BLUE: a method for automatic evaluation of machine translation, 2002.
[42]
V. Pham, T. Bluche, C. Kermorvant, J. Louradour, Dropout improves recurrent neural networks for handwriting recognition, 2014.
[43]
A. Rohrbach, M. Rohrbach, B. Schiele, The long-short story of movie description, 2015.
[44]
A. Rohrbach, M. Rohrbach, N. Tandon, B. Schiele, A dataset for movie description, 2015.
[45]
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, B. Schiele, Translating video content to natural language descriptions, 2013.
[46]
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.
[47]
J. Song, S. Tang, J. Xiao, F. Wu, Z.M. Zhang, Lstm-in-lstm for generating long descriptions of images, Comput. Visual Media, 2 (2016) 379-388.
[48]
C. Sun, C. Gan, R. Nevatia, Automatic concept discovery from parallel text and visual corpora, 2015.
[49]
I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, 2014.
[50]
J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, R.J. Mooney, Integrating language and vision to generate natural language descriptions of videos in the wild, 2014.
[51]
R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image description evaluation, 2015.
[52]
S. Venugopalan, L.A. Hendricks, R.J. Mooney, K. Saenko, Improving lstm-based video description with linguistic knowledge mined from text, 2016.
[53]
S. Venugopalan, M. Rohrbach, J. Donahue, R.J. Mooney, T. Darrell, K. Saenko, Sequence to sequence - video to text, 2015.
[54]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R.J. Mooney, K. Saenko, Translating videos to natural language using deep recurrent neural networks, 2015.
[55]
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, 2015.
[56]
Q. Wu, P. Wang, C. Shen, A.R. Dick, A. van den Hengel, Ask me anything: free-form visual question answering based on knowledge from external sources, 2016.
[57]
H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, K. Saenko, A multi-scale multiple instance video description network, CoRR, abs/1505.05914 (2015).
[58]
J. Xu, T. Mei, T. Yao, Y. Rui, MSR-VTT: a large video description dataset for bridging video and language, 2016.
[59]
K. Xu, J. Ba, R. Kiros, K. Cho, A.C. Courville, R. Salakhutdinov, R.S. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, 2015.
[60]
R. Xu, C. Xiong, W. Chen, J.J. Corso, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, 2015.
[61]
Y. Yan, E. Ricci, R. Subramanian, O. Lanz, N. Sebe, No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion, International Conference on Computer Vision (ICCV) (2013).
[62]
Y. Yan, E. Ricci, R. Subramanian, G. Liu, N. Sebe, Multi-task linear discriminant analysis for multi-view action recognition, IEEE Transactions on Image Processing (TIP), 23 (2014) 5599-5611.
[63]
Y. Yan, E. Ricci, R. Subramanian, G. Liu, O. Lanz, N. Sebe, A multi-task learning framework for head pose estimation under target motion, IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38 (2016) 1070-1083.
[64]
Y. Yang, C.L. Teo, H. Daum III, Y. Aloimonos, Corpus-guided sentence generation of natural images, 2011.
[65]
L. Yao, A. Torabi, K. Cho, N. Ballas, C.J. Pal, H. Larochelle, A.C. Courville, Describing videos by exploiting temporal structure, 2015.
[66]
P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions, TACL, 2 (2014) 67-78.
[67]
H. Yu, J. Wang, Z. Huang, Y. Yang, W. Xu, Video paragraph captioning using hierarchical recurrent neural networks, 2016.
[68]
Z. Zha, T. Mei, Z. Wang, X. Hua, Building a comprehensive ontology to refine video concept detection, 2007.
[69]
H. Zhang, X. Shang, H. Luan, M. Wang, T. Chua, Learning from collective intelligence: feature learning using social images and tags, TOMCCAP, 13 (2016) 1:1-1:23.
[70]
H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, T.-S. Chua, Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval, 2013.
[71]
X. Zhang, K. Gao, Y. Zhang, D. Zhang, Q. Tian, Task-driven dynamic fusion: reducing ambiguity in video description, 2017.
[72]
L. Zhu, J. Shen, L. Xie, Z. Cheng, Unsupervised visual hashing with semantic assistant for content-based image retrieval, IEEE Trans. Knowl. Data Eng., 29 (2017) 472-486.

Cited By

View all
  • (2023)Machine Generation of Audio Description for Blind and Visually Impaired PeopleACM Transactions on Accessible Computing10.1145/359095516:2(1-28)Online publication date: 24-Jun-2023
  • (2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 1-Dec-2023
  • (2023)Multimodal-enhanced hierarchical attention network for video captioningMultimedia Systems10.1007/s00530-023-01130-w29:5(2469-2482)Online publication date: 15-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding
Computer Vision and Image Understanding  Volume 163, Issue C
October 2017
89 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 October 2017

Author Tags

  1. Deep learning
  2. Multi-modal fusion
  3. Semantic discovery
  4. Video to text

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Machine Generation of Audio Description for Blind and Visually Impaired PeopleACM Transactions on Accessible Computing10.1145/359095516:2(1-28)Online publication date: 24-Jun-2023
  • (2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 1-Dec-2023
  • (2023)Multimodal-enhanced hierarchical attention network for video captioningMultimedia Systems10.1007/s00530-023-01130-w29:5(2469-2482)Online publication date: 15-Jul-2023
  • (2023)VMSG: a video caption network based on multimodal semantic grouping and semantic attentionMultimedia Systems10.1007/s00530-023-01124-829:5(2575-2589)Online publication date: 13-Jun-2023
  • (2022)V2T: video to text framework using a novel automatic shot boundary detection algorithmMultimedia Tools and Applications10.1007/s11042-022-12343-y81:13(17989-18009)Online publication date: 1-May-2022
  • (2022)Visualized Analysis of the Emerging Trends of Automated Audio Description TechnologyMachine Learning for Cyber Security10.1007/978-3-031-20096-0_8(99-108)Online publication date: 2-Dec-2022
  • (2021)What we see in a photograph: content selection for image captioningThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-020-01867-937:6(1309-1326)Online publication date: 1-Jun-2021
  • (2020)Video Storytelling: Textual Summaries for EventsIEEE Transactions on Multimedia10.1109/TMM.2019.293004122:2(554-565)Online publication date: 24-Jan-2020
  • (2020)Object-aware semantics of attention for image captioningMultimedia Tools and Applications10.1007/s11042-019-08209-579:3-4(2013-2030)Online publication date: 1-Jan-2020
  • (2020)A review on the long short-term memory modelArtificial Intelligence Review10.1007/s10462-020-09838-153:8(5929-5955)Online publication date: 1-Dec-2020
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media