Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3664647.3681505acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections

LiteGfm: A Lightweight Self-supervised Monocular Depth Estimation Framework for Artifacts Reduction via Guided Image Filtering

Published: 28 October 2024 Publication History


Facing two significant challenges for monocular depth estimation under a lightweight network, including the preservation of detail information and the artifact reduction of the predicted depth maps, this paper proposes a self-supervised monocular depth estimation framework, called LiteGfm. It contains a DepthNet with an Anti-Artifact Guided (AAG) module and a PoseNet. In the AAG module, a Guided Image Filtering with cross-detail masking is first designed to filter the input features of the decoder for preserving comprehensive detail information. Second, a filter kernel generator is proposed to decompose the Sobel operator along the vertical and horizontal axes for achieving cross-detail masking, which better captures the structure and edge feature for minimizing artifacts. Furthermore, a boundary-aware loss between the reconstructed and input images is presented to preserve high-frequency details for decreasing artifacts. Extensive experimental results demonstrate that LiteGfm under 1.9M parameters gets more optimal performance than state-of-the-art methods.


Juan Luis Gonzalez Bello, Jaeho Moon, and Munchurl Kim. 2023. Detail-Preserving Self-Supervised Monocular Depth with Self-Supervised Structural Sharpening. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2023), 254--264.
Hong Cai, Janarbek Matai, Shubhankar Borse, Yizhe Zhang, Amin Ansari, and Fatih Porikli. 2021. X-distill: Improving Self-supervised Monocular Depth via Cross-task Distillation. arXiv preprint arXiv:2110.12516 (2021).
Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8001--8008.
George R Cross and Anil K Jain. 1983. Markov random field texture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1983), 25--39.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020).
David Eigen and Rob Fergus. 2015. Predicting Depth, Surface Normals, and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision. 2650--2658.
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image Using a Multi-scale Deep Network. Advances in neural information processing systems, Vol. 27 (2014).
Wei Gao, Di Rao, Yang Yang, and Jie Chen. 2023. Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation. IEEE Robotics and Automation Letters (2023).
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision Meets Robotics: The KITTI Dataset. The International Journal of Robotics Research, Vol. 32 (2013), 1231--1237.
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.
Kaiming He, Jian Sun, and Xiaoou Tang. 2012. Guided Image Filtering. IEEE Transactions on pattern analysis and machine intelligence, Vol. 35, 6 (2012), 1397--1409.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for Mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.
Wu Huikai, Zheng Shuai, Zhang Junge, and Kaiqi Huang. 2018. Fast End-to-End Trainable Guided Filter. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1838--1847.
Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. 2018. Unsupervised Learning of Multi-frame Optical Flow with Occlusions. In Proceedings of the European Conference on Computer Vision. 690--706.
Hyunyoung Jung, Eunhyeok Park, and Sungjoo Yoo. 2021. Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12642--12652.
John Lafferty, Andrew McCallum, Fernando Pereira, et al. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In International Conference on Machine Learning, Vol. 1. Williamstown, MA, 3.
Yijun Li, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2019. Joint Image Filtering with Deep Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 8 (2019), 1909--1923.
Zhi Li, Shaoshuai Shi, Bernt Schiele, and Dengxin Dai. 2023. Test-Time Domain Adaptation for Monocular Depth Estimation. In IEEE International Conference on Robotics and Automation. IEEE, 4873--4879.
Jie Liang, Hui Zeng, and Lei Zhang. 2022. Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5657--5666.
Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G Narasimhan, and Jan Kautz. 2019. Neural RGB-D Sensing: Depth and Uncertainty from a Video Camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10986--10995.
Zhong Liu, Ran Li, Shuwei Shao, Xingming Wu, and Weihai Chen. 2023. Self-Supervised Monocular Depth Estimation With Self-Reference Distillation and Disparity Offset Refinement. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 33 (2023), 7565--7577.
David G Lowe. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Vol. 2. IEEE, 1150--1157.
Albert Luginov and Ilya Makarov. 2023. Swiftdepth: An Efficient Hybrid CNN-Transformer Model for Self-supervised Monocular Depth Estimation on Mobile Devices. In IEEE International Symposium on Mixed and Augmented Reality Adjunct. IEEE, 642--647.
Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every Pixel Counts: Joint Learning of Geometry and Motion with 3D Holistic Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 10 (2019), 2624--2641.
Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. 2021. Hr-depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2294--2301.
Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv preprint arXiv:2110.02178 (2021).
Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the Uncertainty of Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3227--3237.
Yallamandaiah S. and Purnachand N. 2021. An Effective Face Recognition Method Using Guided Image Filter and Convolutional Neural Network. Indonesian Journal of Electrical Engineering and Computer Science (2021).
Ashutosh Saxena, Min Sun, and Andrew Y. Ng. 2009. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, 5 (2009), 824--840.
Xuepeng Shi, Georgi Dikov, Gerhard Reitmayr, Tae-Kyun Kim, and Mohsen Ghafoorian. 2023. 3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces. IEEE/CVF International Conference on Computer Vision (2023), 9099--9109.
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. 2019. Unsupervised Domain Adaptation for Depth Prediction from Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 10 (2019), 2396--2409.
Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning Depth from Monocular Videos Using Direct Methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2022--2030.
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image Quality Assessment: from Error Visibility to Structural Similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612.
Yongyang Xu, Liang Wu, Zhong Xie, and Zhanlong Chen. 2018. Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters. Remote Sensing, Vol. 10 (2018), 144.
Sumit Kr. Yadav and Kishor Prabhakar Sarawadekar. 2020. Steering Kernel-Based Guided Image Filter for Single Image Dehazing. IEEE REGION 10 CONFERENCE (TENCON) (2020), 444--449.
Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3vo: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1281--1292.
Zhichao Yin and Jianping Shi. 2018. Geonet: Unsupervised Learning of Dense Depth, Optical Flow, and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1983--1992.
Zhichao Yin and Jianping Shi. 2018. Geonet: Unsupervised Learning of Dense Depth, Optical Flow, and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1983--1992.
Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated Residual Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 636--644.
Jure vZbontar and Yann LeCun. 2016. Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches. Journal of Machine Learning Research, Vol. 17, 65 (2016), 1--32.
Ning Zhang, Francesco Nex, George Vosselman, and Norman Kerle. 2023. Lite-mono: A Lightweight CNN and Transformer Architecture for Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18537--18546.
Ping Zhang, Jingwen Liu, Xiaoyang Wang, Tian Pu, Chun Fei, and Zhengkui Guo. 2020. Stereoscopic Video Saliency Detection Based on Spatiotemporal Correlation and Depth Confidence Optimization. Neurocomputing, Vol. 377 (2020), 256--268.
S. Zhang, H. Fu, Yan Yuguang, Zhang Yubing, Wu Qingyao, Yang Ming, Tan Mingkui, and Xu Yanwu. 2019. Attention Guided Network for Retinal Image Segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention.
Xiang Zhang, Wanqing Zhao, Wei Zhang, Jinye Peng, and Jianping Fan. 2022. Guided Filter Network for Semantic Image Segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 2695--2709.
Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and Xiangyang Ji. 2023. Deep Attentional Guided Image Filtering. IEEE Transactions on Neural Networks and Learning Systems (2023), 1--15.
Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and Xiangyang Ji. 2023. Deep Attentional Guided Image Filtering. IEEE Transactions on Neural Networks and Learning Systems (2023).
Chao Zhou, Hong Zhang, Xiaoyong Shen, and Jiaya Jia. 2017. Unsupervised Learning of Stereo Matching. In Proceedings of the IEEE International Conference on Computer Vision. 1567--1575.
Hang Zhou, David Greenwood, and Sarah Taylor. 2021. Self-supervised Monocular Depth Estimation with Internal Feature Fusion. arXiv preprint arXiv:2110.09482 (2021).
Hang Zhou, David Greenwood, and Sarah Taylor. 2021. Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. In British Machine Vision Conference.
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised Learning of Depth and Ego-motion from Video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.
Zhongkai Zhou, Xinnan Fan, Pengfei Shi, and Yuanxue Xin. 2021. R-msfm: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12777--12786.

Index Terms

  1. LiteGfm: A Lightweight Self-supervised Monocular Depth Estimation Framework for Artifacts Reduction via Guided Image Filtering



      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors


      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024


      Request permissions for this article.

      Check for updates

      Author Tags

      1. guided image filter
      2. lightweight network
      3. monocular depth estimation


      • Research-article

      Funding Sources


      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • 0
        Total Citations
      • 110
        Total Downloads
      • Downloads (Last 12 months)110
      • Downloads (Last 6 weeks)58
      Reflects downloads up to 27 Feb 2025

      Other Metrics


      View Options

      Login options

      View options


      View or Download as a PDF file.



      View online with eReader.







      Share this Publication link

      Share on social media