research-article

LiteGfm: A Lightweight Self-supervised Monocular Depth Estimation Framework for Artifacts Reduction via Guided Image Filtering

Authors:

Tianhao GuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 8903 - 8912

https://doi.org/10.1145/3664647.3681505

Published: 28 October 2024 Publication History

Abstract

Facing two significant challenges for monocular depth estimation under a lightweight network, including the preservation of detail information and the artifact reduction of the predicted depth maps, this paper proposes a self-supervised monocular depth estimation framework, called LiteGfm. It contains a DepthNet with an Anti-Artifact Guided (AAG) module and a PoseNet. In the AAG module, a Guided Image Filtering with cross-detail masking is first designed to filter the input features of the decoder for preserving comprehensive detail information. Second, a filter kernel generator is proposed to decompose the Sobel operator along the vertical and horizontal axes for achieving cross-detail masking, which better captures the structure and edge feature for minimizing artifacts. Furthermore, a boundary-aware loss between the reconstructed and input images is presented to preserve high-frequency details for decreasing artifacts. Extensive experimental results demonstrate that LiteGfm under 1.9M parameters gets more optimal performance than state-of-the-art methods.

References

[1]

Juan Luis Gonzalez Bello, Jaeho Moon, and Munchurl Kim. 2023. Detail-Preserving Self-Supervised Monocular Depth with Self-Supervised Structural Sharpening. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2023), 254--264. https://api.semanticscholar.org/CorpusID:260912119

[2]

Hong Cai, Janarbek Matai, Shubhankar Borse, Yizhe Zhang, Amin Ansari, and Fatih Porikli. 2021. X-distill: Improving Self-supervised Monocular Depth via Cross-task Distillation. arXiv preprint arXiv:2110.12516 (2021).

[3]

Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8001--8008.

Digital Library

[4]

George R Cross and Anil K Jain. 1983. Markov random field texture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1983), 25--39.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255. https://doi.org/10.1109/CVPR.2009.5206848

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929 (2020).

[7]

David Eigen and Rob Fergus. 2015. Predicting Depth, Surface Normals, and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision. 2650--2658.

Digital Library

[8]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image Using a Multi-scale Deep Network. Advances in neural information processing systems, Vol. 27 (2014).

[9]

Wei Gao, Di Rao, Yang Yang, and Jie Chen. 2023. Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation. IEEE Robotics and Automation Letters (2023).

[10]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision Meets Robotics: The KITTI Dataset. The International Journal of Robotics Research, Vol. 32 (2013), 1231--1237. https://api.semanticscholar.org/CorpusID:9455111

Digital Library

[11]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.

[12]

Kaiming He, Jian Sun, and Xiaoou Tang. 2012. Guided Image Filtering. IEEE Transactions on pattern analysis and machine intelligence, Vol. 35, 6 (2012), 1397--1409.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[14]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for Mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.

[15]

Wu Huikai, Zheng Shuai, Zhang Junge, and Kaiqi Huang. 2018. Fast End-to-End Trainable Guided Filter. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1838--1847. https://doi.org/10.1109/CVPR.2018.00197

[16]

Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. 2018. Unsupervised Learning of Multi-frame Optical Flow with Occlusions. In Proceedings of the European Conference on Computer Vision. 690--706.

Digital Library

[17]

Hyunyoung Jung, Eunhyeok Park, and Sungjoo Yoo. 2021. Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12642--12652.

[18]

John Lafferty, Andrew McCallum, Fernando Pereira, et al. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In International Conference on Machine Learning, Vol. 1. Williamstown, MA, 3.

[19]

Yijun Li, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2019. Joint Image Filtering with Deep Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 8 (2019), 1909--1923. https://doi.org/10.1109/TPAMI.2018.2890623

Digital Library

[20]

Zhi Li, Shaoshuai Shi, Bernt Schiele, and Dengxin Dai. 2023. Test-Time Domain Adaptation for Monocular Depth Estimation. In IEEE International Conference on Robotics and Automation. IEEE, 4873--4879.

[21]

Jie Liang, Hui Zeng, and Lei Zhang. 2022. Details or Artifacts: A Locally Discriminative Learning Approach to Realistic Image Super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5657--5666.

[22]

Chao Liu, Jinwei Gu, Kihwan Kim, Srinivasa G Narasimhan, and Jan Kautz. 2019. Neural RGB-D Sensing: Depth and Uncertainty from a Video Camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10986--10995.

[23]

Zhong Liu, Ran Li, Shuwei Shao, Xingming Wu, and Weihai Chen. 2023. Self-Supervised Monocular Depth Estimation With Self-Reference Distillation and Disparity Offset Refinement. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 33 (2023), 7565--7577. https://api.semanticscholar.org/CorpusID:257038223

Digital Library

[24]

David G Lowe. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Vol. 2. IEEE, 1150--1157.

[25]

Albert Luginov and Ilya Makarov. 2023. Swiftdepth: An Efficient Hybrid CNN-Transformer Model for Self-supervised Monocular Depth Estimation on Mobile Devices. In IEEE International Symposium on Mixed and Augmented Reality Adjunct. IEEE, 642--647.

[26]

Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every Pixel Counts: Joint Learning of Geometry and Motion with 3D Holistic Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 10 (2019), 2624--2641.

Digital Library

[27]

Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. 2021. Hr-depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2294--2301.

[28]

Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv preprint arXiv:2110.02178 (2021).

[29]

Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the Uncertainty of Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3227--3237.

[30]

Yallamandaiah S. and Purnachand N. 2021. An Effective Face Recognition Method Using Guided Image Filter and Convolutional Neural Network. Indonesian Journal of Electrical Engineering and Computer Science (2021). https://api.semanticscholar.org/CorpusID:239042820

[31]

Ashutosh Saxena, Min Sun, and Andrew Y. Ng. 2009. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, 5 (2009), 824--840. https://doi.org/10.1109/TPAMI.2008.132

Digital Library

[32]

Xuepeng Shi, Georgi Dikov, Gerhard Reitmayr, Tae-Kyun Kim, and Mohsen Ghafoorian. 2023. 3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces. IEEE/CVF International Conference on Computer Vision (2023), 9099--9109. https://api.semanticscholar.org/CorpusID:267016418

[33]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).

[34]

Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. 2019. Unsupervised Domain Adaptation for Depth Prediction from Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 10 (2019), 2396--2409.

Digital Library

[35]

Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning Depth from Monocular Videos Using Direct Methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2022--2030.

[36]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image Quality Assessment: from Error Visibility to Structural Similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612.

Digital Library

[37]

Yongyang Xu, Liang Wu, Zhong Xie, and Zhanlong Chen. 2018. Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters. Remote Sensing, Vol. 10 (2018), 144. https://api.semanticscholar.org/CorpusID:40804218

[38]

Sumit Kr. Yadav and Kishor Prabhakar Sarawadekar. 2020. Steering Kernel-Based Guided Image Filter for Single Image Dehazing. IEEE REGION 10 CONFERENCE (TENCON) (2020), 444--449. https://api.semanticscholar.org/CorpusID:229374214

[39]

Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3vo: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1281--1292.

[40]

Zhichao Yin and Jianping Shi. 2018. Geonet: Unsupervised Learning of Dense Depth, Optical Flow, and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1983--1992.

[41]

Zhichao Yin and Jianping Shi. 2018. Geonet: Unsupervised Learning of Dense Depth, Optical Flow, and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1983--1992.

[42]

Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated Residual Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 636--644. https://doi.org/10.1109/CVPR.2017.75

[43]

Jure vZbontar and Yann LeCun. 2016. Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches. Journal of Machine Learning Research, Vol. 17, 65 (2016), 1--32.

Digital Library

[44]

Ning Zhang, Francesco Nex, George Vosselman, and Norman Kerle. 2023. Lite-mono: A Lightweight CNN and Transformer Architecture for Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18537--18546.

[45]

Ping Zhang, Jingwen Liu, Xiaoyang Wang, Tian Pu, Chun Fei, and Zhengkui Guo. 2020. Stereoscopic Video Saliency Detection Based on Spatiotemporal Correlation and Depth Confidence Optimization. Neurocomputing, Vol. 377 (2020), 256--268.

Digital Library

[46]

S. Zhang, H. Fu, Yan Yuguang, Zhang Yubing, Wu Qingyao, Yang Ming, Tan Mingkui, and Xu Yanwu. 2019. Attention Guided Network for Retinal Image Segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. https://api.semanticscholar.org/CorpusID:198986024

[47]

Xiang Zhang, Wanqing Zhao, Wei Zhang, Jinye Peng, and Jianping Fan. 2022. Guided Filter Network for Semantic Image Segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 2695--2709. https://api.semanticscholar.org/CorpusID:247628466

Digital Library

[48]

Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and Xiangyang Ji. 2023. Deep Attentional Guided Image Filtering. IEEE Transactions on Neural Networks and Learning Systems (2023), 1--15. https://doi.org/10.1109/TNNLS.2023.3253472

[49]

Zhiwei Zhong, Xianming Liu, Junjun Jiang, Debin Zhao, and Xiangyang Ji. 2023. Deep Attentional Guided Image Filtering. IEEE Transactions on Neural Networks and Learning Systems (2023).

[50]

Chao Zhou, Hong Zhang, Xiaoyong Shen, and Jiaya Jia. 2017. Unsupervised Learning of Stereo Matching. In Proceedings of the IEEE International Conference on Computer Vision. 1567--1575.

[51]

Hang Zhou, David Greenwood, and Sarah Taylor. 2021. Self-supervised Monocular Depth Estimation with Internal Feature Fusion. arXiv preprint arXiv:2110.09482 (2021).

[52]

Hang Zhou, David Greenwood, and Sarah Taylor. 2021. Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. In British Machine Vision Conference. https://api.semanticscholar.org/CorpusID:239015886

[53]

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised Learning of Depth and Ego-motion from Video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.

[54]

Zhongkai Zhou, Xinnan Fan, Pengfei Shi, and Yuanxue Xin. 2021. R-msfm: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12777--12786.

Index Terms

LiteGfm: A Lightweight Self-supervised Monocular Depth Estimation Framework for Artifacts Reduction via Guided Image Filtering
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation
Image and Graphics
Abstract
The self-supervised depth and camera pose estimation methods are proposed to address the difficulty of acquiring the densely labeled ground-truth data and have achieved a great advance. As the stereo vision could constrain the predicted depth to a ...
Transferring knowledge from monocular completion for self-supervised monocular depth estimation
Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...
Self-supervised monocular depth estimation based on image texture detail enhancement
Abstract
We present a new self-supervised monocular depth estimation method with multi-scale texture detail enhancement. Based on the observation that the image texture detail and the semantic information have essential significance on the depth estimation,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Qingdao Postdoctoral Applied Foundation
Postdoctoral Innovation Project of Shandong Province

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
110
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)58

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten