Video prediction: a step-by-step improvement of a video synthesis network

Beibei Jing¹,
Hongwei Ding¹,
Zhijun Yang¹,
Bo Li¹ &
…
Liyong Bao¹

632 Accesses
1 Altmetric
Explore all metrics

Abstract

Although focusing on the field of video generation has made some progress in network performance and computational efficiency, there is still much room for improvement in terms of the predicted frame number and clarity. In this paper, a depth learning model is proposed to predict future video frames. The model can predict video streams with complex pixel distributions of up to 32 frames. Our framework is mainly composed of two modules: a fusion image prediction generator and an image-video translator. The fusion picture prediction generator is realized by a U-Net neural network built by a 3D convolution, and the image-video translator is composed of a conditional generative adversarial network built by a 2D convolution network. In the proposed framework, given a set of fusion images and labels, the image picture prediction generator can learn the pixel distribution of the fitted label pictures from the fusion images. The image-video translator then translates the output of the fused image prediction generator into future video frames. In addition, this paper proposes an accompanying convolution model and corresponding algorithm for improving image sharpness. Our experimental results prove the effectiveness of this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-linear integration of loss terms for improved new view synthesis

Article 31 July 2023

Image-to-Video Translation Using a VAE-GAN with Refinement Network

Video DeCaptioning Using U-Net with Stacked Dilated Convolutional Layers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Abbreviations

z :: Gaussian noise
x :: Monitoring conditions (this paper refers to the edge detection map)
X :: Collection of edge detection maps
y :: Labels (this article refers to real samples)
s :: Semantic segmentation graph
S :: Semantic segmentation graph collection
m :: Input fusion and pictures
M :: Imported collection of fusion images
$ \hat{m} $ :: Predicted output of fusion images
$ \hat{M} $ :: Collection of fused images for predictive output
G :: Generator functions
D :: Discriminator function
F( X, Y ):: Structural similarity function
ϕ :: Variance
μ :: Average value
w :: Weighting Matrix
h :: Height
c :: Number of channels
C :: Constants
V :: Objective function
L :: Loss function
n :: Batchsize
N :: Number of loss function samples
η :: Learning Rate
$ {\left\Vert \bullet \right\Vert}_2^2 $ :: Square of L2 parametric number
Tr[·]:: Traces of the matrix

References

Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves, A, Kavukcuoglu K (2017) Video pixel networks. In: 2017 International Conference on Machine Learning, pp 1–2
Lotter W, Kreiman G, Cox D (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104
Byeon W, Wang Q, Srivastava RK, Koumoutsakos P (2018).Contextvp: fully context-aware video prediction. In: 2018 In Proceedings of the European Conference on Computer Vision (ECCV), pp 753-769
Finn C, Goodfellow I, Levine S (2016) Unsupervised learning for physical interaction through video prediction. arXiv preprint arXiv:1605.07157
Shi X, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC. (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv preprint arXiv:1506.04214
Xue T, Wu J, Bouman KL, Freeman WT (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. arXiv preprint arXiv:1607.02586
Villegas R, Yang J, Hong S, Lin X, Lee H (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033
Mathieu M, Couprie C, LeCun Y (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440
Liu W, Luo W, Lian D, Gao S (2018) Future frame prediction for anomaly detection–a new baseline. In: 2018 Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp 1-7
Yi Z, Zhang H, Tan P, Gong M (2017) Dualgan: unsupervised dual learning for image-to-image translation. In: 2017 in Proceedings of the IEEE international conference on computer vision, pp 2-3
Liang X, Lee L, Dai W, Xing, EP (2017) Dual motion gan for future-flow embedded video prediction. In: 2017 In proceedings of the IEEE international conference on computer vision, pp 1–7
Denton E, Birodkar V (2017) Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915
Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017). Learning to generate long-term future via hierarchical prediction. In:2017 in international conference on machine learning, pp. 3560-3569
Oprea S, Martinez-Gonzalez P, Garcia-Garcia A, Castro-Vargas JA, Orts-Escolano S, Garcia-Rodriguez J, Argyros A (2020) A review on deep learning techniques for video prediction. IEEE Trans Pattern Anal Mach Intell:1
Hsieh JT, Liu B, Huang DA, Fei-Fei L, Niebles JC (2018) Learning to decompose and disentangle representations for video prediction. arXiv preprint arXiv:1806.04166
Xu Y, Gao L, Tian K, Zhou S, Sun H (2019) Non-local convlstm for video compression artifact reduction. In:2019 in Proceedings of the IEEE/CVF International Conference On Computer Vision, pp 7043-7052)
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In:2015 in International conference on machine learning, pp 843-852
Walker J, Doersch C, Gupta A, Hebert M (2016, October) An uncertain future: forecasting from static images using variational autoencoders. In:2016 in European Conference on Computer Vision (ECCV), pp 835-851
Ye Y, Singh M, Gupta A, Tulsiani S (2019) Compositional video prediction. In:2019 in proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10353-10362
Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: 2017 in IEEE International Conference on Computer Vision (ICCV), pp 2830-2839
Tulyakov S, Liu MY, Yang X, Kautz J (2018) MoCoGAN: decomposing motion and content for video generation. In:2018 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1526-1535
Wang TC, Liu MY, Tao A, Liu G, Kautz J, Catanzaro B (2019) Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713
Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. arXiv preprint arXiv:1808.06601
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Wang Y, Zhang J, Zhu H, Long M, Wang J, Yu PS (2019) Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In:2019 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9154-9162
Karacan L, Akata Z, Erdem A, Erdem E (2016) Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215
Reed S, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. arXiv preprint arXiv:1610.02454
Lee AX, Zhang R, Ebert F, Abbeel P, Finn C, Levine S (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523
Liang X, Lee L, Dai W, Xing EP (2017) Dual motion Gan for future-flow embedded video prediction. In:2017 In proceedings of the IEEE international conference on computer vision, pp 1744-1752
Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. arXiv preprint arXiv:1609.02612
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In:2015 in Proceedings of the IEEE international conference on computer vision, pp 4489-4497
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In:2017 in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722-3731
Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In:2018 in Proceedings of the European conference on computer vision (ECCV), pp 172-189
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In:2017 in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125-1134
Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848
Liu MY, Tuzel O (2016) Coupled generative adversarial networks. arXiv preprint arXiv:1606.07536
Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In:2017 In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107-2116
Taigman Y, Polyak A, Wolf L (2016). Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200
Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In:2018 in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798-8807
Zhu JY, Park T, Isola P, Efros AA. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In:2017 In Proceedings of the IEEE international conference on computer vision, pp 2223-2232
Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA., Wang O, Shechtman E (2017) Toward multimodal image-to-image translation. arXiv preprint arXiv:1711.11586
Kittler J (1983) On the accuracy of the Sobel edge detector. Image Vis Comput 1(1):37–42
Article Google Scholar
Torre V, Poggio TA (1986) On edge detection. IEEE 2:147–163
Google Scholar
Arbelaez P, Maire M, Fowlkes C, Malik J (2010) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916
Article Google Scholar
Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. In Advances in neural information processing systems, pp 584-592
Dollár P, Zitnick CL (2014) Fast edge detection using structured forests. In:2014 IEEE transactions on pattern analysis and machine intelligence 37(8):1558–1570
Google Scholar
Xie S, Tu Z (2015) Holistically-nested edge detection. In:2015 in proceedings of the IEEE international conference on computer vision, pp 1395-1403
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In:2015 in proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431-3440
Bertasius G, Shi J, Torresani L (2015) Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In:2015 In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4380-4389
Wang Z, Zhu S, Li Y, Cui Z (2016) Convolutional neural network based deep conditional random fields for stereo matching. J Vis Commun Image Represent 40:739–750
Article Google Scholar
Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 6:679–698
Article Google Scholar
Yu Z, Feng C, Liu MY, Ramalingam S (2017) Casenet: Deep category-aware semantic edge detection. In :2017 In Proceedings of the IEEE conference on computer vision and pattern recognition,pp. 5964–5973
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725-1732
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In:2015 In International Conference on Medical image computing and computer-assisted intervention,pp 234–241
Chen Q, Koltun V (2017) Photographic image synthesis with cascaded refinement networks. In:2017 In Proceedings of the IEEE international conference on computer vision, pp 1511-1520
Pérez-Hernández F, Tableik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590
Article Google Scholar
Olmos R, Tableik S, Lamas A, Perez-Hernandez F, Herrera F (2019) A binocular image fusion approach for minimizing false positives in handgun detection with deep learning. Information Fusion 49:271–280
Article Google Scholar
Ma J, Xu H, Jiang J, Mei X, Zhang XP (2020) DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans Image Process 29:4980–4995
Article Google Scholar
Singh VK, Rashwan HA, Romani S, Akram F, Pandey N, Sarker MMK, Torrents-Barrena J (2020) Breast tumor segmentation and shape classification in mammograms using generative adversarial and convolutional neural network. Expert Syst Appl 139:112855
Article Google Scholar
Zhang J, Yawei H (2020) Image-to-image translation based on improved cycle-consistent generative adversarial network. J Electron Inf Technol 42(5):1216–1222
Google Scholar
He J, Zhang S, Yang M, Shan Y, Huang T (2019) Bi-directional cascade network for perceptual edge detection. In: 2019 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3828-3837
Yu Z, Liu W, Zou Y, Feng C, Ramalingam S, Kumar BVK, Kautz J (2018) Simultaneous edge alignment and learning. In:2018 in Proceedings of the European Conference on Computer Vision (ECCV), pp 388-404
Zhang Y, Shi L, Wu Y, Cheng K, Cheng J, Lu H (2020) Gesture recognition based on deep deformable 3D convolutional neural networks. Pattern Recogn 107:107416
Article Google Scholar

Download references

Funding

This work is partially supported by the National Natural Science Foundation of China(61461053) and Yunnan University of the China Postgraduate Science Foundation under Grant (2020306).

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, 650000, Kunming, China
Beibei Jing, Hongwei Ding, Zhijun Yang, Bo Li & Liyong Bao

Authors

Beibei Jing
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhijun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Liyong Bao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongwei Ding.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jing, B., Ding, H., Yang, Z. et al. Video prediction: a step-by-step improvement of a video synthesis network. Appl Intell 52, 3640–3652 (2022). https://doi.org/10.1007/s10489-021-02500-5

Download citation

Accepted: 03 May 2021
Published: 08 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02500-5

Video prediction: a step-by-step improvement of a video synthesis network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Non-linear integration of loss terms for improved new view synthesis

Image-to-Video Translation Using a VAE-GAN with Refinement Network

Video DeCaptioning Using U-Net with Stacked Dilated Convolutional Layers

Abbreviations

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Video prediction: a step-by-step improvement of a video synthesis network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Non-linear integration of loss terms for improved new view synthesis

Image-to-Video Translation Using a VAE-GAN with Refinement Network

Video DeCaptioning Using U-Net with Stacked Dilated Convolutional Layers

Explore related subjects

Abbreviations

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now