Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation Adaptation

Published: 23 September 2024 Publication History

Abstract

Predicting depth maps from monocular images has made an impressive performance in the past years. However, most depth estimation methods are trained with paired image-depth map data or multi-view images (e.g., stereo pair and monocular sequence), which suffer from expensive annotation costs and poor transferability. Although unsupervised domain adaptation methods are introduced to mitigate the reliance on annotated data, rare works focus on the unsupervised cross-scenario indoor monocular depth estimation. In this article, we propose to study the generalization of depth estimation models across different indoor scenarios in an adversarial-based domain adaptation paradigm. Concretely, a domain discriminator is designed for discriminating the representation from source and target domains, while the feature extractor aims to confuse the domain discriminator by capturing domain-invariant features. Further, we reconstruct depth maps from latent representations with the supervision of labeled source data. As a result, the feature extractor learned features possess the merit of both domain-invariant and low source risk, and the depth estimator can deal with the domain shift between source and target domains. We conduct the cross-scenario and cross-dataset experiments on the ScanNet and NYU-Depth-v2 datasets to verify the effectiveness of our method and achieve impressive performance.

References

[1]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems. 2366–2374.
[2]
Jiehua Zhang, Liang Li, Chenggang Yan, Yaoqi Sun, Tao Shen, Jiyong Zhang, and Zhan Wang. 2021. Heuristic depth estimation with progressive depth reconstruction and confidence-aware loss. In Proceedings of the 29th ACM International Conference on Multimedia. 2252–2261.
[3]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. 3354–3361.
[4]
Ravi Garg, Vijay K. Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision. Springer, 740–756.
[5]
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1851–1858.
[6]
Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the European Conference on Computer Vision (ECCV ’20). 582–600.
[7]
Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3227–3237.
[8]
Juan L. Gonzalez and Munchurl Kim. 2021. Plade-net: Towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6851–6860.
[9]
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE 109, 1 (2020), 43–76.
[10]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from
[11]
Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Journal of Big Data 3, 1 (2016), 1–40.
[12]
Michele Mancini, Gabriele Costante, Paolo Valigi, Thomas A. Ciarfuglia, Jeffrey Delmerico, and Davide Scaramuzza. 2017. Toward domain independence for learning-based monocular depth estimation. IEEE Robotics and Automation Letters 2, 3 (2017), 1778–1785.
[13]
Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2018. T2Net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Proceedings of the European Conference on Computer Vision (ECCV ’18). 767–783.
[14]
Lina Liu, Xibin Song, Mengmeng Wang, Yong Liu, and Liangjun Zhang. 2021. Self-supervised monocular depth estimation for all day images using domain separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12737–12746.
[15]
Jogendra N. Kundu, Phani K. Uppala, Anuj Pahuja, and R. Venkatesh Babu. 2018. Adadepth: Unsupervised content congruent adaptation for depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2656–2665.
[16]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems 27 (2014), 53–65.
[17]
Yunhan Zhao, Shu Kong, Daeyun Shin, and Charless Fowlkes. 2020. Domain decluttering: Simplifying images to mitigate synthetic-real domain shift and improve depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3330–3340.
[18]
Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, and Kurt Keutzer. 2020. A review of single-source deep unsupervised visual domain adaptation. IEEE Transactions on Neural Networks and Learning Systems 33 (2020), 473–493.
[19]
Xinyang Chen, Sinan Wang, Jianmin Wang, and Mingsheng Long. 2021. Representation subspace distance for domain adaptation regression. In Proceedings of the International Conference on Machine Learning. PMLR, 1749–1759.
[20]
Shanshan Zhao, Huan Fu, Mingming Gong, and Dacheng Tao. 2019. Geometry-aware symmetric domain adaptation for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9788–9798.
[21]
Madhu Vankadari, Sourav Garg, Anima Majumder, Swagat Kumar, and Ardhendu Behera. 2020. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In Proceedings of the 16th European Conference on Computer Vision (ECCV ’20). Springer, 443–459.
[22]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning. PMLR, 97–105.
[23]
Baochen Sun, Jiashi Feng, and Kate Saenko. 2017. Correlation alignment for unsupervised domain adaptation. In Proceedings of the Domain Adaptation in Computer Vision Applications. Springer, 153–171.
[24]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17, 1 (2016) 2096–2030.
[25]
Xingchao Peng, Yichen Li, and Kate Saenko. 2020. Domain2vec: Domain embedding for unsupervised domain adaptation. In Proceedings of the 16th European Conference on Computer Vision, (ECCV ’20). Springer, 756–774.
[26]
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167–7176.
[27]
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5828–5839.
[28]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision. Springer, 746–760.
[29]
Zhiwen Yang, Jiehua Zhang, Liang Li, Chenggang Yan, Yaoqi Sun, and Haibing Yin. 2024. Progressive depth decoupling and modulating for flexible depth completion. arXiv:2405.09342. Retrieved from
[30]
H. Cui, L. Zhu, J. Li, Y. Yang, and L. Nie. 2020. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on Image Processing 29 (2020), 1271–1284.
[31]
Xu Lu, Lei Zhu, Zhiyong Cheng, Liqiang Nie, and Huaxiang Zhang. 2019. Online multi-modal hashing with dynamic query-adaption. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19). 715–724.
[32]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2023. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3003–3018.
[33]
Liang Li, Xingyu Gao, Jincan Deng, Yunbin Tu, Zheng-Jun Zha, and Qingming Huang. 2022. Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing 31 (2022), 2726–2738
[34]
Yunbin Tu, Liang Li, Li Su, Shengxiang Gao, Chenggang Yan, Zheng-Jun Zha, Zhengtao Yu, and Qingming Huang. 2022. I2Transformer: Intra-and inter-relation embedding transformer for TV show captioning. IEEE Transactions on Image Processing 31 (2022), 3565–3577.
[35]
Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.
[36]
Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 4, (2020), 1445–1451.
[37]
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 4th International Conference on 3D Vision (3DV ’16). IEEE, 239–248.
[38]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12179–12188.
[39]
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002–2011.
[40]
Shariq F. Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. AdaBins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4009–4018.
[41]
Junjie Hu, Yan Zhang, and Takayuki Okatani. 2019. Visualization of convolutional neural networks for monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3869–3878.
[42]
Tom van Dijk and Guido de Croon. 2019. How do neural networks see depth in single images? In Proceedings of the IEEE International Conference on Computer Vision. 2183–2191.
[43]
Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. 2017. Conditional adversarial domain adaptation. arXiv:1705.10667. Retrieved from
[44]
Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. 2016. FCNs in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Retrieved from
[45]
Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video technology 32, 1 (2021), 43–51.
[46]
Chenggang Yan, Tong Teng, Yutao Liu, Yongbing Zhang, Haoqian Wang, and Xiangyang Ji. 2021. Precise no-reference image quality evaluation based on distortion identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 3s (2021), 1–21.
[47]
Chenggang Yan, Lixuan Meng, Liang Li, Jiehua Zhang, Zhan Wang, Jian Yin, Jiyong Zhang, Yaoqi Sun, and Bolun Zheng. 2022. Age-invariant face recognition by multi-feature fusionand decomposition with self-attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1s (2022), 1–18.
[48]
Songlin Dong, Xiaopeng Hong, Xiaoyu Tao, Xinyuan Chang, Xing Wei, and Yihong Gong. 2021. Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1255–1263.
[49]
Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, and Yihong Gong. 2022. Model behavior preserving for class-incremental learning. IEEE Transactions on Neural Networks and Learning Systems, Vol. 34. 1–12.
[50]
Amir Atapour-Abarghouei and Toby P. Breckon. 2018. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2800–2810.
[51]
Hiroyasu Akada, Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2022. Self-supervised learning of domain invariant features for depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications Of Computer Vision. 3377–3387.
[52]
Shao-Yuan Lo, Wei Wang, Jim Thomas, Jingjing Zheng, Vishal M. Patel, and Cheng-Hao Kuo. 2022. Learning feature decomposition for domain adaptive monocular depth estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’22). IEEE, 8376–8382.
[53]
Xiaotian Chen, Yuwang Wang, Xuejin Chen, and Wenjun Zeng. 2021. S2R-DepthNet: Learning a generalizable depth-specific structural representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3034–3043.
[54]
Lina Liu, Xibin Song, Mengmeng Wang, Yuchao Dai, Yong Liu, and Liangjun Zhang. 2023. AGDF-Net: Learning domain generalizable depth features with adaptive guidance fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 46. 3137–3155.
[55]
Adrian Lopez-Rodriguez and Krystian Mikolajczyk. 2020. DESC: Domain adaptation for depth estimation via semantic consistency. In Proceedings of the British Machine Vision Conference (BMVC ’20).
[56]
Adrian Lopez-Rodriguez and Krystian Mikolajczyk. 2023. DESC: Domain adaptation for depth estimation via semantic consistency. International Journal of Computer Vision 131, 3 (2023), 752–771.
[57]
Kieran Saunders, George Vogiatzis, and Luis J. Manso. 2023. Self-supervised monocular depth estimation: Let's talk about the weather. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8907–8917.
[58]
Jiyuan Wang, Chunyu Lin, Lang Nie, Shujun Huang, Yao Zhao, Xing Pan, and Rui Ai. 2023. Weatherdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. arXiv:2310.05556.
[59]
Adrian Lopez-Rodriguez, Benjamin Busam, and Krystian Mikolajczyk. 2023. Project to adapt: Domain adaptation for depth completion from noisy and sparse sensor data. International Journal of Computer Vision 131, 3 (2023), 796–812.
[60]
Jaime Spencer, Richard Bowden, and Simon Hadfield. 2020. DeFeat-Net: General monocular depth via simultaneous unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14402–14413.
[61]
Yannick Verdié, Jifei Song, Barnabé Mas, Benjamin Busam, Aleš Leonardis, and Steven McDonagh. 2022. Cromo: Cross-modal learning for monocular depth estimation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’22). IEEE, 3927–3937.
[62]
Ukcheol Shin, Kwanyong Park, Byeong-Uk Lee, Kyunghyun Lee, and In S. Kweon. 2023. Self-supervised monocular depth estimation from thermal images via adversarial multi-spectral adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5798–5807.
[63]
Ukcheol Shin, Kwanyong Park, Kyunghyun Lee, Byeong-Uk Lee, and In S. Kweon. 2023. Joint self-supervised learning and adversarial adaptation for monocular depth estimation from thermal image. Machine Vision and Applications 34, 4 (2023), 55.
[64]
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1623–1637.
[65]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[66]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017), 834–848.
[67]
Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen P. Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV ’17). 2813–2821.
[68]
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer W. Vaughan. 2010. A theory of learning from different domains. Machine Learning 79, 1 (2010), 151–175.
[69]
Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il H. Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326. Retrieved from
[70]
Michael Rudolph, Youssef Dawoud, Ronja Güldenring, Lazaros Nalpantidis, and Vasileios Belagiannis. 2022. Lightweight monocular depth estimation through guided decoding. In Proceedings of the International Conference on Robotics and Automation (ICRA ’22). IEEE, 2344–2350.

Cited By

View all
  • (2024)Subjective and Objective Quality-of-Experience Assessment for 3D Talking HeadsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680964(6033-6042)Online publication date: 28-Oct-2024
  • (2024)MV-BART: Multi-view BART for Multi-modal Sarcasm DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679570(3602-3611)Online publication date: 21-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 9
September 2024
780 pages
EISSN:1551-6865
DOI:10.1145/3613681
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2024
Online AM: 13 June 2024
Accepted: 25 May 2024
Revised: 18 May 2024
Received: 22 October 2023
Published in TOMM Volume 20, Issue 9

Check for updates

Author Tags

  1. Indoor depth estimation
  2. unsupervised learning
  3. transfer learning
  4. domain adaptation

Qualifiers

  • Research-article

Funding Sources

  • National Key R & D Program of China
  • Guangdong Basic and Applied Basic Research Foundation
  • “Pioneer” and “Leading Goose” R & D Program of Zhejiang
  • Zhejiang Province Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)226
  • Downloads (Last 6 weeks)27
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Subjective and Objective Quality-of-Experience Assessment for 3D Talking HeadsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680964(6033-6042)Online publication date: 28-Oct-2024
  • (2024)MV-BART: Multi-view BART for Multi-modal Sarcasm DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679570(3602-3611)Online publication date: 21-Oct-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media