Dilated temporal relational adversarial network for generic video summarization

Yujia Zhang ORCID: orcid.org/0000-0002-2335-7657^1,2,
Michael Kampffmeyer³,
Xiaodan Liang⁴,
Dingwen Zhang⁵,
Min Tan^1,2 &
…
Eric P. Xing⁴

746 Accesses
Explore all metrics

Abstract

The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoder-Decoder Architectures based Video Summarization using Key-Shot Selection Model

Article 16 September 2023

A comprehensive study of automatic video summarization techniques

Article 13 March 2023

A multi-stage deep adversarial network for video summarization with knowledge distillation

Article 24 January 2022

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/, software available from tensorflow.org
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial nets. In: Proceedings of international conference on machine learning
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:170605587
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the 31st AAAI conference on artificial intelligence, pp 3981–3987
Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 606–612
Chu WS, Song Y, Jaimes A (2015) Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE conference on computer vision and pattern recognition
De Avila SEF, Lopes APB, da Luz A Jr, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
Article Google Scholar
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476
Fu TJ, Tai SH, Chen HT (2019) Attentive and adversarial learning for video summarization. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 1579–1587
Ghosh A, Kulharia V, Mukerjee A, Namboodiri V, Bansal M (2016) Contextual rnn-gans for abstract reasoning diagram generation. arXiv:160909444
Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in neural information processing systems, pp 2069–2077
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18 (5–6):602–610
Article Google Scholar
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6645–6649
Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: Proceedings of European conference on computer vision, pp 505–520
Chapter Google Scholar
Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3090–3098
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang S, Li X, Zhang Z, Wu F, Han J (2019) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664
Article MathSciNet Google Scholar
Ji Z, Xiong K, Pang Y, Li X (2017) Video summarization with attention-based encoder-decoder networks. arXiv:170809545
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology
Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4225–4232
Kulesza A, Taskar B, et al. (2012) Determinantal point processes for machine learning. Found Trends®; Mach Learn 5(2–3):123–286
Article Google Scholar
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks for action segmentation and detection. arXiv:161105267
Li Y, Wang L, Yang T, Gong B (2018) How local is the local diversity? reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarizations. In: Proceedings of the European conference on computer vision, pp 156–174
Liang X, Lee L, Dai W, Xing EP (2017) Dual motion GAN for future-flow embedded video prediction
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Mao X, Li Q, Xie H, Lau RY, Wang Z, Smolley SP (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2813–2821
Mathieu MF, Zhao JJ, Zhao J, Ramesh A, Sprechmann P, LeCun Y (2016) Disentangling factors of variation in deep representation using adversarial training. In: Advances in neural information processing systems, pp 5040–5048
Meng J, Wang H, Yuan J, Tan YP (2016) From keyframes to key objects: Video summarization by representative object proposal selection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1039–1048
Panda R, Roy-Chowdhury AK (2017) Collaborative summarization of topic-related videos. In: Proc IEEE Conf Comput Vis Pattern Recogn, vol 2, p 5
Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of European conference on computer vision, pp 540–555
Chapter Google Scholar
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:151106434
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv:160505396
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242
Sharghi A, Gong B, Shah M (2016) Query-focused extractive video summarization. In: Proceedings of European conference on computer vision
Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107–2116
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, pp 321–330
Song Y, Vallmitjana J, Stent A, Jaimes A (2015) TVSum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4489–4497
Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2235–2244
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 982–990
Zatsushi K, Luc VG, Yoshitaka U, Tatsuya H (2018) Viewpoint-aware video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067
Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of European conference on computer vision, pp 766–782
Chapter Google Scholar
Zhang S, Zhu Y, Roy-Chowdhury AK (2016) Context-aware surveillance video summarization. IEEE Trans Image Process 25(11):5469–5478
Article MathSciNet Google Scholar
Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: Dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China. ACM, p 89
Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520
Zhao S, Zhao X, Ding G, Keutzer K (2018) EmotionGAN: unsupervised domain adaptation for learning discrete probability distributions of image emotions. In: ACM Multimedia conference on multimedia conference, pp 1319–1327
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of AAAI conference on artificial intelligence, pp 7582–7589
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:170310593

Download references

Acknowledgements

We would like to thank Xiaohui Zeng for her valuable discussions. This project is supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. This work is also partially funded by the National Natural Science Foundation of China (Grant No. 61673378 and 61333016), and Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, 100190, Beijing, China
Yujia Zhang & Min Tan
University of Chinese Academy of Sciences, 100049, Beijing, China
Yujia Zhang & Min Tan
Machine Learning Group, UiT The Arctic University of Norway, 9019, Tromsø, Norway
Michael Kampffmeyer
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Xiaodan Liang & Eric P. Xing
Xidian University, 710071, Xi’an, China
Dingwen Zhang

Authors

Yujia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kampffmeyer
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Dingwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Min Tan
View author publications
You can also search for this author in PubMed Google Scholar
Eric P. Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujia Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done while Yujia Zhang was at CMU

Electronic supplementary material

Below is the link to the electronic supplementary material.

(MOV 108 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Kampffmeyer, M., Liang, X. et al. Dilated temporal relational adversarial network for generic video summarization. Multimed Tools Appl 78, 35237–35261 (2019). https://doi.org/10.1007/s11042-019-08175-y

Download citation

Received: 06 November 2018
Revised: 07 June 2019
Accepted: 02 September 2019
Published: 12 October 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11042-019-08175-y

Dilated temporal relational adversarial network for generic video summarization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Encoder-Decoder Architectures based Video Summarization using Key-Shot Selection Model

A comprehensive study of automatic video summarization techniques

A multi-stage deep adversarial network for video summarization with knowledge distillation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dilated temporal relational adversarial network for generic video summarization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Encoder-Decoder Architectures based Video Summarization using Key-Shot Selection Model

A comprehensive study of automatic video summarization techniques

A multi-stage deep adversarial network for video summarization with knowledge distillation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation