Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Video Storytelling: Textual Summaries for Events

Published: 01 February 2020 Publication History

Abstract

Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.

References

[1]
J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchical approach for generating descriptive image paragraphs,” in Proc. Comput. Vis. Pattern Recognit., 2017, pp. 317–325.
[2]
X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition GAN for visual paragraph generation,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 3362–3371.
[3]
A. Rohrbach et al., “Coherent multi-sentence video description with variable level of detail,” in Proc. German Conf. Pattern Recognit., 2014, pp. 184–195.
[4]
H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 4584–4593.
[5]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 706–715.
[6]
N. Xu et al., “Dual-stream recurrent neural network for video captioning,” IEEE Trans. Circuits Syst. Video Technol., to be published.
[7]
A. Liu, Y. Qiu, Y. Wong, Y. Su, and M. S. Kankanhalli, “A fine-grained spatial-temporal attention model for video captioning,” IEEE Access, vol. 6, pp. 68 463–68 471, 2018.
[8]
A. Liu et al., “Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language,” Comput. Vis. Image Understanding, vol. 163, pp. 113–125, 2017.
[9]
C. C. Park and G. Kim, “Expressing an image stream with a sequence of natural sentences,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 73–81.
[10]
Y. Liu, J. Fu, T. Mei, and C. W. Chen, “Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 1445–1452.
[11]
F. Dufaux, “Key frame selection to represent a video,” in Proc. Int. Conf. Image Process., 2000, pp. 275–278.
[12]
D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz, “Schematic storyboarding for video visualization and editing,” ACM Trans. Graph., vol. 25, no. 3, pp. 862–871, 2006.
[13]
M. Gygli, H. Grabner, and L. J. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 3090–3098.
[14]
Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Proc. Comput. Vis. Pattern Recognit., 2013, pp. 2714–2721.
[15]
D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. Eur. Conf. Comput. Vis., 2014, vol. 8694, pp. 540–555.
[16]
K. Zhang, W. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 1059–1067.
[17]
Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 5179–5187.
[18]
F. Huang et al., “Visual storytelling,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, 2016, pp. 1233–1239.
[19]
B. Dai, D. Lin, R. Urtasun, and S. Fidler, “Towards diverse and natural image descriptions via a conditional GAN,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2970–2979.
[20]
J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2016, pp. 110–119.
[21]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–17.
[22]
K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[23]
A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.
[24]
J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Attention transfer from web images for video recognition,” in Proc. ACM Multimedia, 2017, pp. 1–9.
[25]
L. Yao et al., “Describing videos by exploiting temporal structure,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4507–4515.
[26]
J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Dual-glance model for deciphering social relationships,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 2669–2678.
[27]
S. Venugopalan et al., “Translating videos to natural language using deep recurrent neural networks,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics, Human Lang. Technol., 2015, pp. 1494–1504.
[28]
S. Venugopalan et al., “Sequence to sequence - video to text,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 4534–4542.
[29]
J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans. Pattern Anal. Mach. Learn., vol. 39, no. 4, pp. 677–691, Apr. 2017.
[30]
J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Unsupervised learning of view-invariant action representations,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1260–1270.
[31]
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep. 2017.
[32]
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res., vol. 47, pp. 853–899, 2013.
[33]
V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1143–1151.
[34]
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 207–218, 2014.
[35]
R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework,” in Proc. AAAI Conf. Artif. Intell., 2015, pp. 2346–2352.
[36]
Y. Zhu et al., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 19–27.
[37]
J. Dong, X. Li, and C. G. M. Snoek, “Predicting visual features from text for image and video caption retrieval,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, Dec. 2018.
[38]
Y. H. Tsai, L. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 3591–3600.
[39]
F. Faghri, D. J. Fleet, J. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proc. Brit. Mach. Vis. Conf., 2018, pp. 1–13.
[40]
S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding based feature learning for video summarization,” IEEE Trans. Multimedia, vol. 16, no. 6, pp. 1497–1509, Oct. 2014.
[41]
B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” in Proc. Comput. Vis. Pattern Recognit., 2017, pp. 1052–1060.
[42]
P. Varini, G. Serra, and R. Cucchiara, “Personalized egocentric video summarization of cultural tour on user preferences input,” IEEE Trans. Multimedia, vol. 19, no. 12, pp. 2832–2845, Dec. 2017.
[43]
A. T. de Pablos et al., “Summarization of user-generated sports video by using deep action recognition features,” IEEE Trans. Multimedia, vol. 20, no. 8, pp. 2000–2011, Aug. 2018.
[44]
B. Xu, Y. Wong, J. Li, Q. Zhao, and M. S. Kankanhalli, “Learning to detect human-object interactions with knowledge,” in Proc. Comput. Vis. Pattern Recognit., 2019, pp. 2019–2028.
[45]
A. B. Vasudevan, M. Gygli, A. Volokitin, and L. V. Gool, “Query-adaptive video summarization via quality-aware relevance estimation,” in Proc. ACM Multimedia, 2017, pp. 582–590.
[46]
S. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video summary evaluation through text,” in Proc. Comput. Vis. Pattern Recognit. Workshop, 2014, pp. 1–15.
[47]
S. Sah et al., “Semantic text summarization of long videos,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2017, pp. 989–997.
[48]
B. Chen, Y. Chen, and F. Chen, “Video to text summary: Joint video summarization and captioning with recurrent neural networks,” in Proc. Brit. Mach. Vis. Conf., 2017, pp. 1–14.
[49]
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256, 1992.
[50]
J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–10.
[51]
V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2204–2212.
[52]
S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proc. Int. Conf. Comput. Vis., 2017, pp. 873–881.
[53]
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in Proc. Int. Conf. Learn. Representations, 2016, pp. 1–16.
[54]
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 7008–7024.
[55]
S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, “End-to-end learning of action detection from frame glimpses in videos,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 2678–2687.
[56]
S. Lan, R. Panda, Q. Zhu, and A. K. Roy-Chowdhury, “Ffnet: Video fast-forwarding via reinforcement learning,” in Proc. Comput. Vis. Pattern Recognit., 2018, pp. 6771–6780.
[57]
R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” in Proc. NIPS Deep Learn. Workshop, 2014, pp. 1–13.
[58]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.
[59]
K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical smachine translation,” in Proc. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
[60]
J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Proc. NIPS Deep Learn. Workshop, 2014, pp. 1–9.
[61]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[62]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.
[63]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. Comput. Vis. Pattern Recognit., 2016, pp. 5288–5296.
[64]
K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLUE: A method for automatic evaluation of machine translation,” in Proc. Assoc. Comput. Linguistics, 2002, pp. 311–318.
[65]
S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization, 2005, pp. 65–72.
[66]
R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.
[67]
R. Socher and F. Li, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proc. Comput. Vis. Pattern Recognit., 2010, pp. 966–973.
[68]
C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. Assoc. Comput. Linguistics Workshop, 2004, pp. 74–81.
[69]
X. Chen et al., “Microsoft COCO captions: Data collection and evaluation server,” 2015, arXiv:1504.00325.
[70]
M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-evaluating automatic metrics for image captioning,” in Proc. Eur. Chapter Assoc. Comput. Linguistics, 2017, pp. 199–209.
[71]
B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2069–2077.
[72]
K. Zhang, W. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. Eur. Conf. Comput. Vis., 2016, vol. 9911, pp. 766–782.
[73]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North Amer. Ch. Ass. Comput. Lingu.: Human Langu. Techno., 2019, pp. 4171–4186.

Cited By

View all
  • (2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 1-Jan-2024
  • (2023)Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methodsProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608091(722-735)Online publication date: 3-Aug-2023
  • (2023)Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334517226(6131-6141)Online publication date: 20-Dec-2023
  • Show More Cited By

Index Terms

  1. Video Storytelling: Textual Summaries for Events
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 22, Issue 2
        Feb. 2020
        281 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 February 2020

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 21 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 1-Jan-2024
        • (2023)Video Annotation & Descriptions using Machine Learning & Deep learning: Critical Survey of methodsProceedings of the 2023 Fifteenth International Conference on Contemporary Computing10.1145/3607947.3608091(722-735)Online publication date: 3-Aug-2023
        • (2023)Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334517226(6131-6141)Online publication date: 20-Dec-2023
        • (2023)Transferring Image-CLIP to Video-Text Retrieval via Temporal RelationsIEEE Transactions on Multimedia10.1109/TMM.2022.322741625(7772-7785)Online publication date: 1-Jan-2023
        • (2023)Multimodal-Based and Aesthetic-Guided Narrative Video SummarizationIEEE Transactions on Multimedia10.1109/TMM.2022.318339425(4894-4908)Online publication date: 1-Jan-2023
        • (2023)Unified Adaptive Relevance Distinguishable Attention Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2022.314160325(1320-1332)Online publication date: 1-Jan-2023
        • (2022)Narrative Dataset: Towards Goal-Driven Narrative GenerationProceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos10.1145/3552463.3557021(7-12)Online publication date: 10-Oct-2022
        • (2022)Compute to Tell the Tale: Goal-Driven Narrative GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3549202(6875-6882)Online publication date: 10-Oct-2022
        • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
        • (2022)An attention based dual learning approach for video captioningApplied Soft Computing10.1016/j.asoc.2021.108332117:COnline publication date: 12-May-2022

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media