Abstract
This work considers the video frame inpainting problem, where several former and latter frames are given, and the goal is to predict the middle frames. The state-of-the-art solution has applied bidirectional long short-term memory (LSTM) networks, which has a spatial-temporal mismatch problem. In this paper, we propose a trapezoid-structured LSTM architecture called T-LSTM-sbm for video frame inpainting with three designs: (i) segregated spatial-temporal gates, (ii) bridge joints, and (iii) multi-kernel LSTM. To prevent the spatial-temporal mismatch problem, while features are being passed through multi-layered LSTM nodes, the trapezoid structure reduces its number of LSTM nodes by two after each layer. This makes the model converge to the inpainted results more effectively. The separated temporal and spatial gates design can learn better spatial and temporal features by using individual gates. To relieve the information loss problem during the convergence of the trapezoidal layers, we use bridge joints among layers to better preserve useful information. The multiple kernels in LSTM are to enable extracting multi-scale information flows. T-LSTM-sbm is proved to outperform the state-of-the-art solutions in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on three common datasets, KTH Action, HMDB-51, and UCF-101.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press, Cambridge (2016)
Wang, H., Schmid, C.: Action recognition with improved trajectories. IEEE ICCV pp. 3551–3558 (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. NIPS pp. 568–576 (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. IEEE ICCV pp. 4489–4497 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. Springer ECCV pp. 20–36 (2016)
Chiu, S.-Y., Tseng, Y.-C., Chen, J.-J.: Low-resolution thermal sensor-guided image synthesis. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) workshops (2023)
Li, J.Y., Lin, J.C.H., Wu, K.R., Tseng, Y.C.: Sensepred: guiding video prediction by wearable sensors. IEEE Internet Things J (2022). https://doi.org/10.1109/JIOT.2022.3219163
Van, L., Zhang, L., Chang, C., Tong, K., Wu, K., Tseng, Y.: Things in the air: tagging wearable iot information on drone videos. Discov. Internet Things 1(1), (2021)
Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. IEEE ICCV pp. 7082–7092 (2019)
Zolfaghari, M., Singh, K., Brox, T.: ECO: Efficient convolutional network for online video understanding. ECCV pp. 695–712 (2018)
Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-Term feature banks for detailed video understanding. IEEE CVPR pp. 284–293 (2019)
Chiang, T., Tseng, Y., Tseng, Y.: A multi-embedding neural model for incident video retrieval. Pattern Recognition 130, 108807 (2022) [Online]. Available: https://doi.org/10.1016/j.patcog.2022.108807
Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 463–476 (2007)
Cheung, S.C.S., Zhao, J., Venkatesh, M.V.: Efficient object-based video inpainting. IEEE ICIP pp. 705–708 (2006)
Wang, C., Chen, X., Min, S., Zha, Z.-J., Wang, J.: Structure-guided deep video inpainting. IEEE Trans. Circuits Syst. Video Technol. 31(8), 2953–2965 (2021)
Li, Z., Lu, C.Z., Qin, J., Guo, C.L., Cheng, M.M.: Towards an end-to-end framework for flow-guided video inpainting. IEEE CVPR pp. 17562–17571 (2022)
Zou, X., Yang, L., Liu, D., Lee, Y.J.: Progressive temporal feature alignment network for video inpainting. IEEE CVPR pp. 16448–16457 (2021)
Ouyang, H., Wang, T., Chen, Q.: Internal video inpainting by implicit long-range propagation. IEEE ICCV pp. 14579–14588 (2021)
Borzi, A., Ito, K., Kunisch, K.: Optimal control formulation for determining optical flow. SIAM J. Sci. Comput. 24(3), 818–847 (2003)
Chen, K., Lorenz, D.A.: Image sequence interpolation based on optical flow, segmentation, and optimal control. IEEE Trans. Image Process. 21(3), 1020–1030 (2011)
Werlberger, M., Pock, T., Unger, M., Bischof, H.: Optical flow guided TV-L1 video interpolation and restoration. Springer EMMCVPR pp. 273–286 (2011)
Jiang, H., Sun, D., Jampani, V., Yang, M.-H., Learned-Miller, E., Kautz, J.: Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. IEEE CVPR pp. 9000–9008 (2018)
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. IEEE CVPR pp. 2270–2279 (2017)
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. IEEE ICCV pp. 4473–4481 (2017)
Liu, Y.-L., Liao, Y.-T., Lin, Y.-Y., Chuang, Y.-Y.: Deep video frame interpolation using cyclic frame generation. AAAI pp. 8794–8802 (2019)
Ahn, H.-E., Jeong, J., Kim, J.W.: A fast 4k video frame interpolation using a hybrid task-based convolutional neural network. MDPI Symmetry 11(5), 619 (2019)
Xiang, X., Tian, Y., Zhang, Y., Fu, Y., Allebach, J.P., Xu, C.: Zooming Slow-Mo: Fast and accurate one-stage space-time video super-resolution. IEEE CVPR pp. 3367–3376 (2020)
Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. IEEE CVPR pp. 3698–3707 (2019)
Bao, W., Lai, W.-S., Zhang, X., Gao, Z., Yang, M.-H.: MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 933–948 (2021)
Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. IEEE CVPR pp. 17814–17823 (2022)
Shi, Z., Xu, X., Liu, X., Chen, J., Yang, M.-H.: Video frame interpolation transformer. IEEE CVPR pp. 17482–17491 (2022)
Sim, H., Oh, J., Kim, M.: Xvfi: extreme video frame interpolation. ICCV pp. 14489–14498 (2021)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. ICML pp. 843–852 (2015)
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning, ICLR (2017)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction,’ ICLR, (2017)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. NIPS pp. 879–888 (2017)
Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. ICML pp. 5123–5132 (2018)
Liu, B., Chen, Y., Liu, S., Kim, H.-S.: Deep learning in latent space for video prediction and compression. IEEE CVPR pp. 701–710 (2021)
Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction, IEEE CVPR, (2020)
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., Kingma, D.: Videoflow: a conditional flow-based model for stochastic video generation, ICLR, (2020)
Szeto, R., Sun, X., Lu, K., Corso, J.J.: A temporally-aware interpolation network for video frame inpainting. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1053–1068 (2019)
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. IEEE CVPR pp. 2261–2269 (2017)
Agethen, S., Hsu, W.H.: Deep multi-kernel convolutional LSTM networks and an attention-based mechanism for videos. IEEE Trans. Multimedia 22(3), 819–829 (2019)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. IEEE ICPR pp. 32–36 (2004)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. IEEE CCV pp. 2556–2563 (2011)
Soomro, K., Zamir, A. R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. NIPS pp. 1171–1179 (2015)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error, ICLR (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization, ICLR (2015)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Acknowledgements
Y.-C. Tseng’s research is co-sponsored by Industrial Technology Research Institute (ITRI), National Science and Technology Council, and Ministry of Science and Technology (MOST). This work is also financially supported by “Center for Open Intelligent Connectivity” of “Higher Education Sprout Project” of National Yang Ming Chiao Tung University (NYCU) and Ministry of Education (MOE), Taiwan.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chiang, TH., Lin, YT., Lin, J.CH. et al. Trapezoid-structured LSTM with segregated gates and bridge joints for video frame inpainting. Vis Comput 40, 1069–1082 (2024). https://doi.org/10.1007/s00371-023-02832-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-02832-y