Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting

Ting-Hui Chiang ORCID: orcid.org/0000-0002-0886-2479¹,
Yun-Tang Lin²,
Jaden Chao-Ho Lin² &
…
Yu-Chee Tseng³

250 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

This work considers the video frame inpainting problem, where several former and latter frames are given, and the goal is to predict the middle frames. The state-of-the-art solution has applied bidirectional long short-term memory (LSTM) networks, which has a spatial-temporal mismatch problem. In this paper, we propose a trapezoid-structured LSTM architecture called T-LSTM-sbm for video frame inpainting with three designs: (i) segregated spatial-temporal gates, (ii) bridge joints, and (iii) multi-kernel LSTM. To prevent the spatial-temporal mismatch problem, while features are being passed through multi-layered LSTM nodes, the trapezoid structure reduces its number of LSTM nodes by two after each layer. This makes the model converge to the inpainted results more effectively. The separated temporal and spatial gates design can learn better spatial and temporal features by using individual gates. To relieve the information loss problem during the convergence of the trapezoidal layers, we use bridge joints among layers to better preserve useful information. The multiple kernels in LSTM are to enable extracting multi-scale information flows. T-LSTM-sbm is proved to outperform the state-of-the-art solutions in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on three common datasets, KTH Action, HMDB-51, and UCF-101.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Temporally-Aware Interpolation Network for Video Frame Inpainting

VideoMamba: State Space Model for Efficient Video Understanding

DANet: Deformable Alignment Network for Video Inpainting

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. IEEE ICCV pp. 3551–3558 (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. NIPS pp. 568–576 (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. IEEE ICCV pp. 4489–4497 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. Springer ECCV pp. 20–36 (2016)
Chiu, S.-Y., Tseng, Y.-C., Chen, J.-J.: Low-resolution thermal sensor-guided image synthesis. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) workshops (2023)
Li, J.Y., Lin, J.C.H., Wu, K.R., Tseng, Y.C.: Sensepred: guiding video prediction by wearable sensors. IEEE Internet Things J (2022). https://doi.org/10.1109/JIOT.2022.3219163
Article Google Scholar
Van, L., Zhang, L., Chang, C., Tong, K., Wu, K., Tseng, Y.: Things in the air: tagging wearable iot information on drone videos. Discov. Internet Things 1(1), (2021)
Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. IEEE ICCV pp. 7082–7092 (2019)
Zolfaghari, M., Singh, K., Brox, T.: ECO: Efficient convolutional network for online video understanding. ECCV pp. 695–712 (2018)
Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-Term feature banks for detailed video understanding. IEEE CVPR pp. 284–293 (2019)
Chiang, T., Tseng, Y., Tseng, Y.: A multi-embedding neural model for incident video retrieval. Pattern Recognition 130, 108807 (2022) [Online]. Available: https://doi.org/10.1016/j.patcog.2022.108807
Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 463–476 (2007)
Article Google Scholar
Cheung, S.C.S., Zhao, J., Venkatesh, M.V.: Efficient object-based video inpainting. IEEE ICIP pp. 705–708 (2006)
Wang, C., Chen, X., Min, S., Zha, Z.-J., Wang, J.: Structure-guided deep video inpainting. IEEE Trans. Circuits Syst. Video Technol. 31(8), 2953–2965 (2021)
Article Google Scholar
Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end framework for flow-guided video inpainting. IEEE CVPR pp. 17562–17571 (2022)
Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment network for video inpainting. IEEE CVPR pp. 16448–16457 (2021)
Ouyang, H., Wang, T., Chen, Q.: Internal video inpainting by implicit long-range propagation. IEEE ICCV pp. 14579–14588 (2021)
Borzi, A., Ito, K., Kunisch, K.: Optimal control formulation for determining optical flow. SIAM J. Sci. Comput. 24(3), 818–847 (2003)
Article MathSciNet Google Scholar
Chen, K., Lorenz, D.A.: Image sequence interpolation based on optical flow, segmentation, and optimal control. IEEE Trans. Image Process. 21(3), 1020–1030 (2011)
Article MathSciNet Google Scholar
Werlberger, M., Pock, T., Unger, M., Bischof, H.: Optical flow guided TV-L1 video interpolation and restoration. Springer EMMCVPR pp. 273–286 (2011)
Jiang, H., Sun, D., Jampani, V., Yang, M.-H., Learned-Miller, E., Kautz, J.: Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. IEEE CVPR pp. 9000–9008 (2018)
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. IEEE CVPR pp. 2270–2279 (2017)
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. IEEE ICCV pp. 4473–4481 (2017)
Liu, Y.-L., Liao, Y.-T., Lin, Y.-Y., Chuang, Y.-Y.: Deep video frame interpolation using cyclic frame generation. AAAI pp. 8794–8802 (2019)
Ahn, H.-E., Jeong, J., Kim, J.W.: A fast 4k video frame interpolation using a hybrid task-based convolutional neural network. MDPI Symmetry 11(5), 619 (2019)
Article Google Scholar
Xiang, X., Tian, Y., Zhang, Y., Fu, Y., Allebach, J.P., Xu, C.: Zooming Slow-Mo: Fast and accurate one-stage space-time video super-resolution. IEEE CVPR pp. 3367–3376 (2020)
Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. IEEE CVPR pp. 3698–3707 (2019)
Bao, W., Lai, W.-S., Zhang, X., Gao, Z., Yang, M.-H.: MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 933–948 (2021)
Article Google Scholar
Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. IEEE CVPR pp. 17814–17823 (2022)
Shi, Z., Xu, X., Liu, X., Chen, J., Yang, M.-H.: Video frame interpolation transformer. IEEE CVPR pp. 17482–17491 (2022)
Sim, H., Oh, J., Kim, M.: Xvfi: extreme video frame interpolation. ICCV pp. 14489–14498 (2021)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. ICML pp. 843–852 (2015)
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning, ICLR (2017)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction,’ ICLR, (2017)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. NIPS pp. 879–888 (2017)
Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. ICML pp. 5123–5132 (2018)
Liu, B., Chen, Y., Liu, S., Kim, H.-S.: Deep learning in latent space for video prediction and compression. IEEE CVPR pp. 701–710 (2021)
Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction, IEEE CVPR, (2020)
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., Kingma, D.: Videoflow: a conditional flow-based model for stochastic video generation, ICLR, (2020)
Szeto, R., Sun, X., Lu, K., Corso, J.J.: A temporally-aware interpolation network for video frame inpainting. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1053–1068 (2019)
Article Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. IEEE CVPR pp. 2261–2269 (2017)
Agethen, S., Hsu, W.H.: Deep multi-kernel convolutional LSTM networks and an attention-based mechanism for videos. IEEE Trans. Multimedia 22(3), 819–829 (2019)
Article Google Scholar
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. IEEE ICPR pp. 32–36 (2004)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. IEEE CCV pp. 2556–2563 (2011)
Soomro, K., Zamir, A. R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. NIPS pp. 1171–1179 (2015)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error, ICLR (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization, ICLR (2015)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar

Download references

Acknowledgements

Y.-C. Tseng’s research is co-sponsored by Industrial Technology Research Institute (ITRI), National Science and Technology Council, and Ministry of Science and Technology (MOST). This work is also financially supported by “Center for Open Intelligent Connectivity” of “Higher Education Sprout Project” of National Yang Ming Chiao Tung University (NYCU) and Ministry of Education (MOE), Taiwan.

Author information

Authors and Affiliations

Advanced Technology Laboratory, Chunghwa Telecom Laboratories, Taoyuan, Taiwan
Ting-Hui Chiang
Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Yun-Tang Lin & Jaden Chao-Ho Lin
College of Artificial Intelligence, National Yang Ming Chiao Tung University, Tainan, Taiwan
Yu-Chee Tseng

Authors

Ting-Hui Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Tang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jaden Chao-Ho Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chee Tseng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting-Hui Chiang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chiang, TH., Lin, YT., Lin, J.CH. et al. Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting. Vis Comput 40, 1069–1082 (2024). https://doi.org/10.1007/s00371-023-02832-y

Download citation

Accepted: 25 February 2023
Published: 11 April 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00371-023-02832-y

Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Temporally-Aware Interpolation Network for Video Frame Inpainting

VideoMamba: State Space Model for Efficient Video Understanding

DANet: Deformable Alignment Network for Video Inpainting

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Temporally-Aware Interpolation Network for Video Frame Inpainting

VideoMamba: State Space Model for Efficient Video Understanding

DANet: Deformable Alignment Network for Video Inpainting

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation