Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

This work considers the video frame inpainting problem, where several former and latter frames are given, and the goal is to predict the middle frames. The state-of-the-art solution has applied bidirectional long short-term memory (LSTM) networks, which has a spatial-temporal mismatch problem. In this paper, we propose a trapezoid-structured LSTM architecture called T-LSTM-sbm for video frame inpainting with three designs: (i) segregated spatial-temporal gates, (ii) bridge joints, and (iii) multi-kernel LSTM. To prevent the spatial-temporal mismatch problem, while features are being passed through multi-layered LSTM nodes, the trapezoid structure reduces its number of LSTM nodes by two after each layer. This makes the model converge to the inpainted results more effectively. The separated temporal and spatial gates design can learn better spatial and temporal features by using individual gates. To relieve the information loss problem during the convergence of the trapezoidal layers, we use bridge joints among layers to better preserve useful information. The multiple kernels in LSTM are to enable extracting multi-scale information flows. T-LSTM-sbm is proved to outperform the state-of-the-art solutions in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on three common datasets, KTH Action, HMDB-51, and UCF-101.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  2. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)

    Google Scholar 

  3. Wang, H., Schmid, C.: Action recognition with improved trajectories. IEEE ICCV pp. 3551–3558 (2013)

  4. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. NIPS pp. 568–576 (2014)

  5. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. IEEE ICCV pp. 4489–4497 (2015)

  6. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. Springer ECCV pp. 20–36 (2016)

  7. Chiu, S.-Y., Tseng, Y.-C., Chen, J.-J.: Low-resolution thermal sensor-guided image synthesis. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) workshops (2023)

  8. Li, J.Y., Lin, J.C.H., Wu, K.R., Tseng, Y.C.: Sensepred: guiding video prediction by wearable sensors. IEEE Internet Things J (2022). https://doi.org/10.1109/JIOT.2022.3219163

    Article  Google Scholar 

  9. Van, L., Zhang, L., Chang, C., Tong, K., Wu, K., Tseng, Y.: Things in the air: tagging wearable iot information on drone videos. Discov. Internet Things 1(1), (2021)

  10. Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. IEEE ICCV pp. 7082–7092 (2019)

  11. Zolfaghari, M., Singh, K., Brox, T.: ECO: Efficient convolutional network for online video understanding. ECCV pp. 695–712 (2018)

  12. Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-Term feature banks for detailed video understanding. IEEE CVPR pp. 284–293 (2019)

  13. Chiang, T., Tseng, Y., Tseng, Y.: A multi-embedding neural model for incident video retrieval. Pattern Recognition 130, 108807 (2022) [Online]. Available: https://doi.org/10.1016/j.patcog.2022.108807

  14. Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 463–476 (2007)

    Article  Google Scholar 

  15. Cheung, S.C.S., Zhao, J., Venkatesh, M.V.: Efficient object-based video inpainting. IEEE ICIP pp. 705–708 (2006)

  16. Wang, C., Chen, X., Min, S., Zha, Z.-J., Wang, J.: Structure-guided deep video inpainting. IEEE Trans. Circuits Syst. Video Technol. 31(8), 2953–2965 (2021)

    Article  Google Scholar 

  17. Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end framework for flow-guided video inpainting. IEEE CVPR pp. 17562–17571 (2022)

  18. Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment network for video inpainting. IEEE CVPR pp. 16448–16457 (2021)

  19. Ouyang, H., Wang, T., Chen, Q.: Internal video inpainting by implicit long-range propagation. IEEE ICCV pp. 14579–14588 (2021)

  20. Borzi, A., Ito, K., Kunisch, K.: Optimal control formulation for determining optical flow. SIAM J. Sci. Comput. 24(3), 818–847 (2003)

    Article  MathSciNet  Google Scholar 

  21. Chen, K., Lorenz, D.A.: Image sequence interpolation based on optical flow, segmentation, and optimal control. IEEE Trans. Image Process. 21(3), 1020–1030 (2011)

    Article  MathSciNet  Google Scholar 

  22. Werlberger, M., Pock, T., Unger, M., Bischof, H.: Optical flow guided TV-L1 video interpolation and restoration. Springer EMMCVPR pp. 273–286 (2011)

  23. Jiang, H., Sun, D., Jampani, V., Yang, M.-H., Learned-Miller, E., Kautz, J.: Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. IEEE CVPR pp. 9000–9008 (2018)

  24. Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. IEEE CVPR pp. 2270–2279 (2017)

  25. Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. IEEE ICCV pp. 4473–4481 (2017)

  26. Liu, Y.-L., Liao, Y.-T., Lin, Y.-Y., Chuang, Y.-Y.: Deep video frame interpolation using cyclic frame generation. AAAI pp. 8794–8802 (2019)

  27. Ahn, H.-E., Jeong, J., Kim, J.W.: A fast 4k video frame interpolation using a hybrid task-based convolutional neural network. MDPI Symmetry 11(5), 619 (2019)

    Article  Google Scholar 

  28. Xiang, X., Tian, Y., Zhang, Y., Fu, Y., Allebach, J.P., Xu, C.: Zooming Slow-Mo: Fast and accurate one-stage space-time video super-resolution. IEEE CVPR pp. 3367–3376 (2020)

  29. Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. IEEE CVPR pp. 3698–3707 (2019)

  30. Bao, W., Lai, W.-S., Zhang, X., Gao, Z., Yang, M.-H.: MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 933–948 (2021)

    Article  Google Scholar 

  31. Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. IEEE CVPR pp. 17814–17823 (2022)

  32. Shi, Z., Xu, X., Liu, X., Chen, J., Yang, M.-H.: Video frame interpolation transformer. IEEE CVPR pp. 17482–17491 (2022)

  33. Sim, H., Oh, J., Kim, M.: Xvfi: extreme video frame interpolation. ICCV pp. 14489–14498 (2021)

  34. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. ICML pp. 843–852 (2015)

  35. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning, ICLR (2017)

  36. Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction,’ ICLR, (2017)

  37. Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. NIPS pp. 879–888 (2017)

  38. Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. ICML pp. 5123–5132 (2018)

  39. Liu, B., Chen, Y., Liu, S., Kim, H.-S.: Deep learning in latent space for video prediction and compression. IEEE CVPR pp. 701–710 (2021)

  40. Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction, IEEE CVPR, (2020)

  41. Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., Kingma, D.: Videoflow: a conditional flow-based model for stochastic video generation, ICLR, (2020)

  42. Szeto, R., Sun, X., Lu, K., Corso, J.J.: A temporally-aware interpolation network for video frame inpainting. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1053–1068 (2019)

    Article  Google Scholar 

  43. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. IEEE CVPR pp. 2261–2269 (2017)

  44. Agethen, S., Hsu, W.H.: Deep multi-kernel convolutional LSTM networks and an attention-based mechanism for videos. IEEE Trans. Multimedia 22(3), 819–829 (2019)

    Article  Google Scholar 

  45. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. IEEE ICPR pp. 32–36 (2004)

  46. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. IEEE CCV pp. 2556–2563 (2011)

  47. Soomro, K., Zamir, A. R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild

  48. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. NIPS pp. 1171–1179 (2015)

  49. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error, ICLR (2016)

  50. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization, ICLR (2015)

  51. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

Download references

Acknowledgements

Y.-C. Tseng’s research is co-sponsored by Industrial Technology Research Institute (ITRI), National Science and Technology Council, and Ministry of Science and Technology (MOST). This work is also financially supported by “Center for Open Intelligent Connectivity” of “Higher Education Sprout Project” of National Yang Ming Chiao Tung University (NYCU) and Ministry of Education (MOE), Taiwan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting-Hui Chiang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chiang, TH., Lin, YT., Lin, J.CH. et al. Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting. Vis Comput 40, 1069–1082 (2024). https://doi.org/10.1007/s00371-023-02832-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02832-y

Keywords

Navigation