research-article

Video Storytelling: Textual Summaries for Events

Authors:

Mohan S. KankanhalliAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 22, Issue 2

Pages 554 - 565

https://doi.org/10.1109/TMM.2019.2930041

Published: 01 February 2020 Publication History

Abstract

Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.

References

[1]

J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchical approach for generating descriptive image paragraphs,” in Proc. Comput. Vis. Pattern Recognit., 2017, pp. 317–325.

[2]

X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition GAN for visual paragraph generation,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 3362–3371.

[3]

A. Rohrbach et al., “Coherent multi-sentence video description with variable level of detail,” in Proc. German Conf. Pattern Recognit., 2014, pp. 184–195.

[4]

H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 4584–4593.

[5]

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 706–715.

[6]

N. Xu et al., “Dual-stream recurrent neural network for video captioning,” IEEE Trans. Circuits Syst. Video Technol., to be published.

[7]

A. Liu, Y. Qiu, Y. Wong, Y. Su, and M. S. Kankanhalli, “A fine-grained spatial-temporal attention model for video captioning,” IEEE Access, vol. 6, pp. 68 463–68 471, 2018.

[8]

A. Liu et al., “Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language,” Comput. Vis. Image Understanding, vol. 163, pp. 113–125, 2017.

Digital Library

[9]

C. C. Park and G. Kim, “Expressing an image stream with a sequence of natural sentences,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 73–81.

[10]

Y. Liu, J. Fu, T. Mei, and C. W. Chen, “Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 1445–1452.

[11]

F. Dufaux, “Key frame selection to represent a video,” in Proc. Int. Conf. Image Process., 2000, pp. 275–278.

[12]

D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz, “Schematic storyboarding for video visualization and editing,” ACM Trans. Graph., vol. 25, no. 3, pp. 862–871, 2006.

Digital Library

[13]

M. Gygli, H. Grabner, and L. J. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 3090–3098.

[14]

Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Proc. Comput. Vis. Pattern Recognit., 2013, pp. 2714–2721.

[15]

D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. Eur. Conf. Comput. Vis., 2014, vol. 8694, pp. 540–555.

[16]

K. Zhang, W. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 1059–1067.

[17]

Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 5179–5187.

[18]

F. Huang et al., “Visual storytelling,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, 2016, pp. 1233–1239.

[19]

B. Dai, D. Lin, R. Urtasun, and S. Fidler, “Towards diverse and natural image descriptions via a conditional GAN,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2970–2979.

[20]

J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2016, pp. 110–119.

[21]

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–17.

[22]

K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

[23]

A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.

[24]

J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Attention transfer from web images for video recognition,” in Proc. ACM Multimedia, 2017, pp. 1–9.

[25]

L. Yao et al., “Describing videos by exploiting temporal structure,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4507–4515.

[26]

J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Dual-glance model for deciphering social relationships,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2669–2678.

[27]

S. Venugopalan et al., “Translating videos to natural language using deep recurrent neural networks,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2015, pp. 1494–1504.

[28]

S. Venugopalan et al., “Sequence to sequence - video to text,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4534–4542.

[29]

J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Learn., vol. 39, no. 4, pp. 677–691, Apr. 2017.

Digital Library

[30]

J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Unsupervised learning of view-invariant action representations,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1260–1270.

[31]

L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep. 2017.

Digital Library

[32]

M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res., vol. 47, pp. 853–899, 2013.

[33]

V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1143–1151.

[34]

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 207–218, 2014.

[35]

R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 2346–2352.

[36]

Y. Zhu et al., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 19–27.

[37]

J. Dong, X. Li, and C. G. M. Snoek, “Predicting visual features from text for image and video caption retrieval,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, Dec. 2018.

Digital Library

[38]

Y. H. Tsai, L. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 3591–3600.

[39]

F. Faghri, D. J. Fleet, J. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proc. Brit. Mach. Vis. Conf., 2018, pp. 1–13.

[40]

S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding based feature learning for video summarization,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1497–1509, Oct. 2014.

[41]

B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” in Proc. Comput. Vis. Pattern Recognit., 2017, pp. 1052–1060.

[42]

P. Varini, G. Serra, and R. Cucchiara, “Personalized egocentric video summarization of cultural tour on user preferences input,” IEEE Trans. Multimedia, vol. 19, no. 12, pp. 2832–2845, Dec. 2017.

[43]

A. T. de Pablos et al., “Summarization of user-generated sports video by using deep action recognition features,” IEEE Trans. Multimedia, vol. 20, no. 8, pp. 2000–2011, Aug. 2018.

[44]

B. Xu, Y. Wong, J. Li, Q. Zhao, and M. S. Kankanhalli, “Learning to detect human-object interactions with knowledge,” in Proc. Comput. Vis. Pattern Recognit., 2019, pp. 2019–2028.

[45]

A. B. Vasudevan, M. Gygli, A. Volokitin, and L. V. Gool, “Query-adaptive video summarization via quality-aware relevance estimation,” in Proc. ACM Multimedia, 2017, pp. 582–590.

[46]

S. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video summary evaluation through text,” in Proc. Comput. Vis. Pattern Recognit. Workshop, 2014, pp. 1–15.

[47]

S. Sah et al., “Semantic text summarization of long videos,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2017, pp. 989–997.

[48]

B. Chen, Y. Chen, and F. Chen, “Video to text summary: Joint video summarization and captioning with recurrent neural networks,” in Proc. Brit. Mach. Vis. Conf., 2017, pp. 1–14.

[49]

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256, 1992.

Digital Library

[50]

J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–10.

[51]

V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2204–2212.

[52]

S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 873–881.

[53]

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, 2016, pp. 1–16.

[54]

S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 7008–7024.

[55]

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 2678–2687.

[56]

S. Lan, R. Panda, Q. Zhu, and A. K. Roy-Chowdhury, “Ffnet: Video fast-forwarding via reinforcement learning,” in Proc. Comput. Vis. Pattern Recognit., 2018, pp. 6771–6780.

[57]

R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” in Proc. NIPS Deep Learn. Workshop, 2014, pp. 1–13.

[58]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.

[59]

K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical smachine translation,” in Proc. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.

[60]

J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Proc. NIPS Deep Learn. Workshop, 2014, pp. 1–9.

[61]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[62]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.

[63]

J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 5288–5296.

[64]

K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLUE: A method for automatic evaluation of machine translation,” in Proc. Assoc. Comput. Linguistics, 2002, pp. 311–318.

[65]

S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization, 2005, pp. 65–72.

[66]

R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.

[67]

R. Socher and F. Li, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proc. Comput. Vis. Pattern Recognit., 2010, pp. 966–973.

[68]

C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. Assoc. Comput. Linguistics Workshop, 2004, pp. 74–81.

[69]

X. Chen et al., “Microsoft COCO captions: Data collection and evaluation server,” 2015, arXiv:1504.00325.

[70]

M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-evaluating automatic metrics for image captioning,” in Proc. Eur. Chapter Assoc. Comput. Linguistics, 2017, pp. 199–209.

[71]

B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2069–2077.

[72]

K. Zhang, W. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. Eur. Conf. Comput. Vis., 2016, vol. 9911, pp. 766–782.

[73]

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North Amer. Ch. Ass. Comput. Lingu.: Human Langu. Techno., 2019, pp. 4171–4186.

Cited By

Lu YNi FWang HGuo XZhu LYang ZSong RCheng LYang Y(2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3296944
Kaushik PSaxena V(2023)Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methodsProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608091(722-735)Online publication date: 3-Aug-2023
https://dl.acm.org/doi/10.1145/3607947.3608091
Cheng YFan HLin DSun YKankanhalli MLim J(2023)Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334517226(6131-6141)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3345172
Show More Cited By

Index Terms

Video Storytelling: Textual Summaries for Events
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
2. Information systems
  1. Information retrieval
  2. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Storytelling for Interactive Digital Media and Video Games
The experience of interactive storytelling: comparing “fahrenheit” with “façade”
ICEC'11: Proceedings of the 10th international conference on Entertainment Computing

At the intersection of multimedia, artificial intelligence, and gaming technology, new visions of future entertainment media arise that approximate the “Holodeck” ® idea of interactive storytelling. We report exploratory experiments on the user ...
Storytelling for Interactive Digital Media and Video Games

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 22, Issue 2

Feb. 2020

281 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 February 2020

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lu YNi FWang HGuo XZhu LYang ZSong RCheng LYang Y(2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3296944
Kaushik PSaxena V(2023)Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methodsProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608091(722-735)Online publication date: 3-Aug-2023
https://dl.acm.org/doi/10.1145/3607947.3608091
Cheng YFan HLin DSun YKankanhalli MLim J(2023)Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334517226(6131-6141)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3345172
Fang HXiong PXu LLuo W(2023)Transferring Image-CLIP to Video-Text Retrieval via Temporal RelationsIEEE Transactions on Multimedia10.1109/TMM.2022.322741625(7772-7785)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3227416
Xie JChen XZhang TZhang YLu SCesar PYang Y(2023)Multimodal-Based and Aesthetic-Guided Narrative Video SummarizationIEEE Transactions on Multimedia10.1109/TMM.2022.318339425(4894-4908)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3183394
Zhang KMao ZLiu AZhang Y(2023)Unified Adaptive Relevance Distinguishable Attention Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2022.314160325(1320-1332)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3141603
Stephen KSheoran RYamazaki SKankanhalli MLiu JWong Y(2022)Narrative Dataset: Towards Goal-Driven Narrative GenerationProceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos10.1145/3552463.3557021(7-12)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3552463.3557021
Wong YFan SGuo YXu ZStephen KSheoran RBhamidipati ABarsopia VLiu JKankanhalli MMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Compute to Tell the Tale: Goal-Driven Narrative GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3549202(6875-6882)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3549202
Prudviraj JReddy MVishnu CMohan C(2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TIP.2022.3195643
Ji WWang RTian YWang X(2022)An attention based dual learning approach for video captioningApplied Soft Computing10.1016/j.asoc.2021.108332117:COnline publication date: 12-May-2022
https://dl.acm.org/doi/10.1016/j.asoc.2021.108332

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents