Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3503161.3547978acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

Published: 10 October 2022 Publication History

Abstract

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation.

Supplementary Material

MP4 File (MM22-fp945.mp4.mp4)
In this video, we will introduce our work Less is More: Consistent Video Depth Estimation with Masked Frames Modeling. The paper is accepted by ACM Multimedia 2022.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lui, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6836--6846.
[2]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2022. BEiT: BERT Pre- Training of Image Transformers. In International Conference on Learning Representations.
[3]
Shariq Farooq Bhat, Ibraheem Alhashim, and PeterWonka. 2021. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4009--4018.
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Advances in neural information processing systems, Vol. 33. 1877--1901.
[5]
Yuanzhouhan Cao, Yidong Li, Haokui Zhang, Chao Ren, and Yifan Liu. 2021. Learning Structure Affinity for Video Depth Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 190--198.
[6]
Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. 2017. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28, 11 (2017), 3174--3182.
[7]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), Vol. 12346. Springer, 213--229.
[8]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning. PMLR, 1691--1703.
[9]
Xinjing Cheng, PengWang, and Ruigang Yang. 2018. Depth estimation via affinity learned with convolutional spatial propagation network. In European Conference on Computer Vision (ECCV), Vol. 11220. 108--125.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). 4171--4186.
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[12]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, Vol. 27. 2366--2374.
[13]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6824--6835.
[14]
Huan Fu, Mingming Gong, ChaohuiWang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2002--2011.
[15]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 11 (2013), 1231--1237.
[16]
C. Godard, O. Aodha, M. Firman, and G. Brostow. 2019. Digging into selfsupervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3828--3838.
[17]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, Vol. 27.
[18]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021).
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[20]
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2, 7 (2015).
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[22]
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2462--2470.
[23]
Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence 36, 11 (2014), 2144--2158.
[24]
Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. 2021. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1611--1621.
[25]
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.
[26]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[27]
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).
[28]
Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. 2018. Deep attention-based classification network for robust depth prediction. In Asian Conference on Computer Vision (ACCV). 663--678.
[29]
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1925--1934.
[30]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2117--2125.
[31]
Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. 2019. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2604--2613.
[32]
Yifan Liu, Changyong Shu, Jingdong Wang, and Chunhua Shen. 2020. Structured knowledge distillation for dense prediction. IEEE transactions on pattern analysis and machine intelligence (2020).
[33]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10012--10022.
[34]
Zhouyong Liu, Shun Luo, Wubin Li, Jingben Lu, Yufan Wu, Shilei Sun, Chunguo Li, and Luxi Yang. 2020. Convtransformer: A convolutional transformer network for video frame synthesis. arXiv preprint arXiv:2011.10185 (2020).
[35]
Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. 2020. Consistent video depth estimation. ACM Transactions on Graphics (ToG) 39, 4 (2020), 71--1.
[36]
Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5667--5675.
[37]
Vaishakh Patil, Wouter Van Gansbeke, Dengxin Dai, and Luc Van Gool. 2020. Don't forget the past: Recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters 5, 4 (2020), 6813--6820.
[38]
Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. 2022. BokehMe: When Neural Rendering Meets Classical Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16283--16292.
[39]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI blog (2018).
[40]
Alec Radford, JeffreyWu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
[41]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12179--12188.
[42]
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zeroshot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44, 03 (2020), 1623--1637.
[43]
Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4104--4113.
[44]
Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), Vol. 9907. 501--518.
[45]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673--2681.
[46]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). Springer, 746--760.
[47]
Chengzhou Tang and Ping Tan. 2018. BA-Net: Dense Bundle Adjustment Networks. In International Conference on Learning Representations.
[48]
Zachary Teed and Jia Deng. 2019. DeepV2D: Video to Depth with Differentiable Structure from Motion. In International Conference on Learning Representations.
[49]
Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision (ECCV). Springer, 402--419.
[50]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, Vol. 30.
[51]
JianyuanWang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li. 2021. Deep two-view structure-from-motion revisited. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8953--8962.
[52]
Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Song Wang, and Lili Ju. 2019. Spatial correspondence with generative adversarial network: Learning depth from monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7494--7504.
[53]
Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. 2018. Monocular Relative Depth Perception With Web Stereo Data Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 311--320.
[54]
Ke Xian, Jianming Zhang, OliverWang, Long Mai, Zhe Lin, and Zhiguo Cao. 2020. Structure-Guided Ranking Loss for Single Image Depth Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 608--617.
[55]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1492--1500.
[56]
Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. Pad-net: Multitasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 675--684.
[57]
Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. 2019. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5684--5693.
[58]
Haokui Zhang, Chunhua Shen, Ying Li, Yuanzhouhan Cao, Yu Liu, and Youliang Yan. 2019. Exploiting temporal consistency for real-time video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1725--1734.
[59]
Xuaner Zhang, Kevin Matzen, Vivien Nguyen, Dillon Yao, You Zhang, and Ren Ng. 2019. Synthetic defocus and look-ahead autofocus for casual videography. ACM Transactions on Graphics (TOG) 38, 4 (2019).
[60]
Zhoutong Zhang, Forrester Cole, Richard Tucker, William T Freeman, and Tali Dekel. 2021. Consistent depth of moving objects in video. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1--12.
[61]
Lipu Zhou, Jiamin Ye, Montiel Abello, Shengze Wang, and Michael Kaess. 2018. Unsupervised learning of monocular depth estimation with bundle adjustment, super-resolution and clip loss. arXiv preprint arXiv:1812.03368 (2018).
[62]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.

Cited By

View all
  • (2024)ViTA: Video Transformer Adaptor for Robust Video Depth EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.330955926(3302-3316)Online publication date: 1-Jan-2024
  • (2024)FutureDepth: Learning to Predict the Future Improves Video Depth EstimationComputer Vision – ECCV 202410.1007/978-3-031-72652-1_26(440-458)Online publication date: 30-Oct-2024
  • (2023)Foreground/Background-Masked Interaction Learning for Spatio-temporal Action DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611945(2381-2390)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. depth estimation
    2. masked frames modeling
    3. temporal consistency

    Qualifiers

    • Research-article

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)121
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ViTA: Video Transformer Adaptor for Robust Video Depth EstimationIEEE Transactions on Multimedia10.1109/TMM.2023.330955926(3302-3316)Online publication date: 1-Jan-2024
    • (2024)FutureDepth: Learning to Predict the Future Improves Video Depth EstimationComputer Vision – ECCV 202410.1007/978-3-031-72652-1_26(440-458)Online publication date: 30-Oct-2024
    • (2023)Foreground/Background-Masked Interaction Learning for Spatio-temporal Action DetectionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611945(2381-2390)Online publication date: 26-Oct-2023
    • (2023)Diffusion-Augmented Depth Prediction with Sparse AnnotationsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611807(2865-2876)Online publication date: 26-Oct-2023
    • (2023)Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01619(17615-17624)Online publication date: 1-Oct-2023
    • (2023)Neural Video Depth Stabilizer2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00868(9432-9442)Online publication date: 1-Oct-2023
    • (2023)MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00804(8720-8730)Online publication date: 1-Oct-2023
    • (2023)Temporally Consistent Online Depth Estimation Using Point-Based Fusion2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00880(9119-9129)Online publication date: Jun-2023
    • (2023)3D Cinemagraphy from a Single Image2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00446(4595-4605)Online publication date: Jun-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media