Open AccessArticle

Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution

Lu Han

^1,*,

Xinghu Wang

¹,

Fuhui Zhou

¹ and

Diansheng Wu

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

Wiscom System Co., Ltd., Nanjing 211100, China

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 4020; https://doi.org/10.3390/electronics13204020

Submission received: 3 September 2024 / Revised: 23 September 2024 / Accepted: 24 September 2024 / Published: 12 October 2024

(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Download

Browse Figures

Figure 1
Visual comparison example on NYU-v2 dataset [<a href="#B32-electronics-13-04020" class="html-bibr">32</a>]. (a) Color image input, (b) low-resolution depth input, (c) ground truth (GT) depth, (d) DKN [<a href="#B3-electronics-13-04020" class="html-bibr">3</a>], (e) DCTNet [<a href="#B6-electronics-13-04020" class="html-bibr">6</a>], and (f) our proposed DMFNet. The visualization and error comparison demonstrates the superior performance of our DMFNet in restoring clear and accurate depth results. "> Figure 2
An overview of the proposed DMFNet, which consists of the degradation learning branch and depth restoration branch. The former branch employs the Deep Degradation Regularization Module (DDRM) to gradually learn explicit degradation from the LR depth, while the latter branch restores fine-grained depth via the Multi-modal Fusion Block (MFB) and the degradation constraint. "> Figure 3
Scheme of the proposed multi-modal fusion block (MFB). "> Figure 4
Visual results on the synthetic NYU-v2 dataset (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math>). "> Figure 5
Visual results on the synthetic RGB-D-D dataset (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math>). "> Figure 6
Visual results on the synthetic Lu dataset (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math>). "> Figure 7
Visual results on the synthetic Middlebury dataset (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>16</mn> </mrow> </semantics></math>). "> Figure 8
Visual results on the real-world RGB-D-D dataset. "> Figure 9
Denoising visual results on the synthetic NYU-v2 dataset. "> Figure 10
Visual comparison of the intermediate depth features on RGB-D-D dataset (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>4</mn> </mrow> </semantics></math>). ">

Versions Notes

Abstract

Depth map super-resolution (DSR) is a technique aimed at restoring high-resolution (HR) depth maps from low-resolution (LR) depth maps. In this process, color images are commonly used as guidance to enhance the restoration procedure. However, the intricate degradation of LR depth poses a challenge, and previous image-guided DSR approaches, which implicitly model the degradation in the spatial domain, often fall short of producing satisfactory results. To address this challenge, we propose a novel approach called the Degradation-Guided Multi-modal Fusion Network (DMFNet). DMFNet explicitly characterizes the degradation and incorporates multi-modal fusion in both spatial and frequency domains to improve the depth quality. Specifically, we first introduce the deep degradation regularization loss function, which enables the model to learn the explicit degradation from the LR depth maps. Simultaneously, DMFNet converts the color images and depth maps into spectrum representations to provide comprehensive multi-domain guidance. Consequently, we present the multi-modal fusion block to restore the depth maps by leveraging both the RGB-D spectrum representations and the depth degradation. Extensive experiments demonstrate that DMFNet achieves state-of-the-art (SoTA) performance on four benchmarks, namely the NYU-v2, Middlebury, Lu, and RGB-D-D datasets.

Keywords:

depth map super-resolution; explicit degradation guidance; multi-modal fusion

1. Introduction

Depth map super-resolution (DSR) [1,2,3,4,5,6,7,8] is a fundamental technique in computer vision that aims to predict high-resolution (HR) depth maps from their low-resolution (LR) counterparts. Accurate depth perception plays a crucial role in various applications, such as 3D reconstruction [9,10,11,12,13,14,15,16], virtual reality [17,18,19,20,21,22], and autonomous driving [23,24,25,26,27,28,29,30]. In the DSR process, color images are commonly used as guidance to enhance the restoration procedure, exploiting the correlation between color and depth information. However, the intricate degradation of LR depth maps poses a significant challenge, and previous image-guided DSR approaches [3,4,6,7,18,31], which implicitly model the degradation in the spatial domain, often fall short of producing satisfactory results. For instance, the depth predictions from DKN [3] in Figure 1d and DCTNet [6] in Figure 1e appear blurry, failing to accurately reflect the true depth values as depicted in the ground truth depth map.

To address these challenges and improve the quality of depth maps, we present a novel approach called the Degradation-Guided Multi-modal Fusion Network (DMFNet). The key idea behind the proposed DMFNet is to introduce explicit modeling of the degradation and comprehensive multi-modal fusion. Specifically, we first introduce the degradation learning branch as a key component. This branch enables the model to explicitly estimate the degradation kernel from the LR depth input. It then utilizes this kernel to filter the predicted HR depth output, thereby establishing a degradation loss function that compares the LR depth input to the HR depth output. The degradation learning branch captures the complex degradation patterns and provides crucial information for the subsequent restoration process. Meanwhile, we propose the multi-modal fusion block to facilitate multi-domain guidance. This block transforms color images and depth maps into spectrum representations, encompassing amplitudes and phases. Subsequently, it performs a difference operation on the RGB-D amplitudes and phases, thereby aiding in the efficient transmission of high-frequency components present in color images (such as boundaries) to the depth maps. Finally, this block utilizes the depth degradation of the degradation learning branch to deeply fuse these RGB-D features in the spatial domain, contributing to further improvement. That is to say, the multi-modal fusion block offers a different perspective on RGB-D data and enhances the effectiveness of fusion with the depth maps and the degradation representations.

Owing to the ingenious designs of the DDRM and the MFB, the proposed DMFNet outperforms many existing approaches. For example, in Figure 1f, DMFNet demonstrates its ability to restore more accurate depth, showcasing sharper and clearer objects that closely resemble those in the ground truth depth map.

In summary, our main contributions include as follows:

We propose a novel degradation-guided framework to enhance the depth recovery, where a deep degradation regularization loss is introduced to explicitly model the intricate degradation of LR depth.
We design a multi-modal fusion block to facilitate the multi-domain and multi-modal interaction via spectrum transform and continuous difference operation.
Extensive experimental results indicate that the proposed DMFNet achieves outstanding performance on four DSR benchmark datasets, reaching the state of the art.

The remainder of this paper is organized as follows. Section 2 provides an overview of related work, including DSR methods and degradation learning approaches. Section 3 presents the DMFNet in detail. Section 4 describes the experimental results and compares them with existing approaches. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Depth Map Super-Resolution

Image-guided DSR methods have gained significant attention due to the rich structure information available in RGB images [10,11,33]. These methods leverage the structure of RGB images to enhance the resolution of depth maps. For example, Shi et al. [34] introduced a symmetric uncertainty method that selects RGB information to effectively recover high-resolution (HR) depth while avoiding texture-related artifacts. Kim et al. [3] designed joint image filtering techniques to adaptively determine the neighboring pixels and their weights for each pixel. Deng et al. [35] proposed a multi-modal convolutional sparse coding approach to separate common and private features among different modalities. Similarly, Zhao et al. [6] developed a discrete cosine network that extracts both shared and specific multi-modal information using a semi-decoupled feature extraction module.

Some methods have also introduced multi-task learning frameworks to leverage complementary information. For instance, Yan et al. [5] introduced an auxiliary depth completion branch to propagate dense depth correlation into the DSR branch. Tang et al. [36] transmitted RGB information to a space close to the depth space through depth estimation, facilitating RGB-D fusion for DSR. Additionally, Sun et al. [19] employed cross-task knowledge distillation to exchange correlations between DSR and depth estimation branches. Recently, Yuan et al. [18] proposed a recursive structure attention method for gradually estimating high-frequency structures, while Yuan et al. [13] designed a structure flow-guided network to learn edge-focused guidance features for depth structure enhancement. Graph regularization [37] and anisotropic diffusion [7] have also been applied to enhance the recovery of depth structures. In contrast to these approaches that mainly focus on the spatial domain, our method pays more attention to the frequency domain by utilizing the high-frequency components of RGB to guide depth structure.

2.2. Frequency Learning

The inherent characteristics of the frequency domain have been widely recognized for their ability to represent structure, leading to the development of various related methods [38,39,40,41,42]. In the frequency domain, Zhou et al. [38,39] integrated spatial and spectrum features for multi-spectrum pan-sharpening. Jiang et al. [40] designed a focal frequency loss to narrow the frequency domain gap between real and generated images. Mao et al. [41] introduced a frequency selective network to adaptively learn kernel-level features for image deblurring. Lin et al. [42] utilized frequency-enhanced variational autoencoders to restore high-frequency components lost during image compression. Inspired by them, we employ the spectrum of RGB to fully guide depth structure in the frequency domain.

2.3. Degradation Learning

Color image super-resolution (SR) aims to enhance the resolution of low-resolution color images. Several approaches have been proposed that focus on learning degradation representations to better understand and model the degradation process [43,44,45,46,47,48,49,50]. For example, Zhang et al. [43] introduced a deep back-projection network that learns an end-to-end mapping between low-resolution and high-resolution images by iteratively refining the SR results. Zhang et al. [44] further improved the performance by incorporating an information loss network that encourages the network to preserve the structural information of the images. Ahn et al. [45] proposed a fast and lightweight SR network that utilizes a degradation network to estimate the degradation process and a reconstruction network to generate high-resolution results. Kim et al. [46] introduced a deep recursive residual network that learns the degradation process through an iterative refinement scheme. Liu et al. [47] designed a deep information-preserving network that exploits both global and local information for better restoration of high-frequency details. Tai et al. [48] presented an image super-resolution via a deep recursive residual network that learns the degradation process using stacked reconstruction units. Recently, Huang et al. [49] proposed a deep edge-guided network that leverages edge information to guide the SR process, resulting in enhanced edge details in the reconstructed images. Liu et al. [50] introduced a deep attention-aware network that incorporates attention mechanisms to selectively enhance informative image regions for improved SR performance.

These methods demonstrate the effectiveness of learning degradation representations for color image super-resolution, enabling the models to better understand the underlying degradation process and improve the quality of the reconstructed high-resolution images.

3. Method

3.1. Network Architecture

Existing mainstream DSR methods usually incorporate color images to guide the depth recovery. As such, we introduce an additional degradation learning branch to explicitly model the depth degradation from the LR depth. Subsequently, in the depth restoration branch, the HR depth is progressively restored through multiple stages, guided by both the color images and depth degradation.

Consider

I_{r g b} \in R^{s h \times s w \times 3}

representing the color image,

D_{l r} \in R^{h \times w \times 1}

denoting the LR depth map, and

D_{h r} \in R^{s h \times s w \times 1}

as the estimated HR depth map.

D_{d e g} \in R^{s h \times s w \times 1}

and

D_{g t} \in R^{s h \times s w \times 1}

denote the degraded depth map and ground truth (GT) depth map, respectively.

D_{u p} \in R^{h \times w \times 1}

is the bicubic interpolation of

D_{l r}

. Here, h and w correspond to the dimensions of the LR depth map, and s signifies the scaling factor (such as

\times 4

\times 8

, or

\times 16

) for upsampling. Figure 2 shows the overall architecture of our DMFNet, which consists of a degradation learning branch (blue part) and a depth restoration branch (orange part).

In the degradation learning branch,

D_{u p}

is first fed into the residual block [43] to predict the depth degradation

F_{d e g}^{1}

, based on which the next degradation

F_{d e g}^{2}

is generated via a

3 \times 3

convolution. Then, we utilize a Multilayer Perceptron (MLP) to estimate the degradation kernel

K

. This kernel will be applied to filter the HR depth map

D_{h r}

to obtain the degraded depth

D_{d e g}

. Please see Equation (13) for more details about the degradation loss.

In the depth restoration branch,

I_{r g b}

is encoded by two continuous

3 \times 3

convolutions, obtaining the feature

F_{r g b}^{1}

. Then, another two

3 \times 3

convolutions are used to produce

F_{r g b}^{2}

D_{l r}

is upsampled by the bicubic interpolation and mapped by a

3 \times 3

convolution, yielding the feature

F_{d}^{1}

. The first MFB takes the joint

F_{d}^{1}

F_{r g b}^{1}

, and

F_{d e g}^{1}

as input. Next, a

3 \times 3

convolution is employed to produce the feature

F_{d}^{2}

. As a result, the second MFB inputs the

F_{d}^{2}

F_{r g b}^{2}

, and

F_{d e g}^{2}

. Finally, a

3 \times 3

convolution is conducted after the second MFB to predict the HR depth output

D_{h r}

. Please see Figure 3 for more details about MFB.

3.2. Depth Degradation Learning

The integration of degradation learning can significantly enhance the performance of super-resolution tasks. Numerous notable studies, such as [43,46,47,48,49,50,51,52], have focused on modeling the degradation of color images in the context of color image super-resolution. Building upon the insights from these research endeavors, we have incorporated degradation learning into the domain of DSR. The depth degradation can be formulated as

D_{l r} = (D_{g t} ⨂ K) ↓_{s} + n,

(1)

where ⨂ indicates the convolution filtering operation,

K

is the degradation kernel,

↓_{s}

denotes the downsampling operation with scale factor s, and

n

usually refers to additive white Gaussian noise. In general, due to the low resolution, the LR depth maps are usually also noisy, especially for real-world data. As a result, following previous researches [51,53,54], we only model the first term of the degradation process while ignoring the noise term.

As illustrated in the blue part of Figure 2, given the input

D_{u p}

, a residual block [43]

f_{r e s}

is conducted to produce the degradation feature

F_{d e g}^{1}

. Then, we use a

3 \times 3

convolution

f_{c 3}

to yield the second degradation feature

F_{d e g}^{2}

. This step is defined as

\begin{matrix} F_{d e g}^{1} = f_{r e s} (D_{u p}), \\ F_{d e g}^{2} = f_{c 3} (F_{d e g}^{1}) . \end{matrix}

(2)

In fact, the degradation learning branch can estimate multiple degradation features

F_{d e g}^{i} = f_{c 3} (F_{d e g}^{i - 1}) .

(3)

Afterwards, an MLP

f_{m}

is employed to estimate the degradation kernel, yielding

K = f_{m} (F_{d e g}^{i}) .

(4)

Finally, given the HR depth prediction

D_{h r}

, we degrade it using the degradation kernel

D_{d e g} = D_{h r} ⨂ K .

(5)

We calculate the error between

D_{d e g}

and

D_{u p}

in Equation (13). This supervisory signal further enables explicit degradation learning.

3.3. Multi-Modal Fusion Block

As shown in the orange part of Figure 2, the MFB fuses the RGB-D features (

F_{r g b}^{i}

and

F_{d}^{i}

) and degradation (

F_{d e g}^{i}

) features simultaneously.

Specifically, the color image

I_{r g b}

is encoded by multiple

3 \times 3

convolutions. The process can be formulated as

\begin{matrix} F_{r g b}^{1} = f_{c 3} (I_{r g b}), \\ F_{r g b}^{i} = f_{c 3} (F_{r g b}^{i - 1}) . \end{matrix}

(6)

Similarly, the depth input

D_{l r}

is upsampled to obtain

D_{u p}

and then mapped by a

3 \times 3

convolution, yielding

F_{d}^{1}

. The ith feature

F_{d}^{i}

is produced based on

F_{d}^{i - 1}

via multiple MFBs and convolutions. We define this step as

\begin{matrix} F_{d}^{1} = f_{c 3} (D_{u p}), \\ F_{d}^{i} = f_{m f b} (f_{c 3} (F_{d}^{i - 1})), \end{matrix}

(7)

where

f_{m f b}

refers to our MFB. Next, we elaborate on the MFB design.

The details of the MFB are demonstrated in Figure 3. Firstly, to take better advantage of the high-frequency components (such as sharp boundaries) in color images, we map the RGB-D input features into the frequency domain by the Discrete Fourier Transform (DCT)

f_{d c t}

, yielding their amplitudes and phases

\begin{matrix} [A_{d}^{i}, φ_{d}^{i}] = f_{d c t} (F_{d}^{i}), \\ [A_{r g b}^{i}, φ_{r g b}^{i}] = f_{d c t} (F_{r g b}^{i}), \end{matrix}

(8)

Due to the inherent lack of sharp boundaries in the degraded LR depth maps, we address this issue by transferring the high-frequency elements from color images into the depth maps through the manipulation of amplitude and phase differences, yielding

\begin{matrix} A_{r g b d}^{i} = | A_{r g b}^{i} - A_{d}^{i} |, \\ φ_{r g b d}^{i} = | φ_{r g b}^{i} - φ_{d}^{i} | . \end{matrix}

(9)

Then, we concatenate the difference results with the initial

A_{d}^{i}

and

φ_{d}^{i}

and use

3 \times 3

convolutions to fuse them, obtaining

\begin{matrix} {\tilde{A}}_{d}^{i} = f_{c 3} (f_{c a t} (A_{d}^{i}, A_{r g b d}^{i})), \\ {\tilde{φ}}_{d}^{i} = f_{c 3} (f_{c a t} (φ_{d}^{i}, φ_{r g b d}^{i})), \end{matrix}

(10)

where

f_{c a t}

refers to the concatenation. Next, we conduct the Inverse Discrete Fourier Transform (IDFT)

f_{i d c t}

to map the features into the spatial domain, obtaining

{\tilde{F}}_{d}^{i} = f_{i d c t} ({\tilde{A}}_{d}^{i}, {\tilde{φ}}_{d}^{i}),

(11)

Finally, we enable the network to adaptively adjust the features based on the degradation representation

F_{d e g}^{i}

via

O_{d}^{i} = {\tilde{F}}_{d}^{i} + {\tilde{F}}_{d}^{i} \cdot F_{d e g}^{i},

(12)

That is to say, we apply more weight where the depth map is heavily degraded, and this weight is derived from the degraded representation.

3.4. Loss Function

In the degradation learning branch, to implement the model of explicit depth degradation, we calculate the error between the degraded depth

D_{d e g}

and the bicubic depth

D_{u p}

via

L_{d e g} = \sum_{q \in Q} {∥ D_{u p}^{q} - D_{d e g}^{q} ∥}_{1},

(13)

where Q is the set of valid pixels of

D_{g t}

and q represents one pixel of the set.

{∥ \cdot ∥}_{1}

denotes the

L_{1}

norm.

In the depth restoration branch, we minimize the

L_{1}

loss between the predicted

D_{h r}

and GT depth

D_{g t}

, yielding

L_{h r} = \sum_{q \in Q} {∥ D_{g t}^{q} - D_{h r}^{q} ∥}_{1} .

(14)

As a result, the total loss function can be written as

L_{t} = L_{h r} + α L_{d e g},

(15)

where

α

(empirically 0.1) is a coefficient that adjusts the proportion of the degradation term.

4. Experiment

4.1. Dataset

We conducted a comprehensive evaluation of our experiments using a variety of datasets, including the synthetic NYU-v2 [32], Middlebury [55,56], and Lu [57] datasets, as well as the real-world RGB-D-D [4] dataset. The NYU-v2 dataset comprises synthetic RGB-D pairs, with 1000 pairs in the training set and 449 pairs in the test set. Additionally, we tested our pre-trained model on the Middlebury dataset, which includes 30 pairs, the Lu dataset with 6 pairs, and the RGB-D-D dataset containing 405 pairs.

In synthetic scenarios, the LR depth input is generated by applying bicubic down-sampling to the HR depth ground truth. This approach allows us to simulate the degradation process and evaluate the performance of our model under controlled conditions.

To assess the generalization capability of our method in real-world environments, we applied our DMFNet to the RGB-D-D dataset. This dataset consists of 2215 RGB-D pairs for training and 405 pairs for testing, all captured using the ToF camera of the Huawei P30 Pro. By leveraging this real-world dataset, we aim to demonstrate the robustness and applicability of our model in practical scenarios.

4.2. Metric and Implementation Detail

To evaluate the performance of our model, we utilize the root mean square error (RMSE) in centimeters as the evaluation metric, consistent with the approaches used in previous studies [3,6]. During the training phase, we randomly crop the RGB images and HR depth maps to a size of

256 \times 256

pixels.

Our DMFNet model is trained using the Adam optimizer [58], starting with an initial learning rate of

1 \times 10^{- 4}

. The training process is conducted on a single TITAN RTX GPU. We set the hyper-parameters as follows:

λ_{1} = λ_{2} = 0.5

γ_{1} = 0.001

, and

γ_{2} = 0.002

By adhering to these settings, we aim to ensure a robust and efficient training process, enabling our model to achieve high performance in both synthetic and real-world scenarios.

4.3. Comparison with SoTA Methods

In this section, we compare the proposed DMFNet with the existing state-of-the-art approaches under

\times 4

\times 8

, and

\times 16

DSR settings, including DJF [2], DJFR [31], PAC [59], CUNet [35], DSRNet [60], DKN [3], FDKN [3], FDSR [4], DAGF [61], GraphSR [37], DCTNet [6], SUFT [4], SSDNet [62], SGNet [8], and SPFNet [63].

Quantitative Comparison. Table 1 and Table 2 demonstrate the exceptional capabilities of our DMFNet in surpassing previous benchmarks across both the synthetic and real-world datasets. In Table 1, our DMFNet stands out by delivering superior results across multiple datasets, showcasing improvements by factors of

\times 4

\times 8

, and

\times 16

. Notably, when compared to the second best, FDSR [4], our DMFNet significantly reduces the RMSE by 0.98 cm on NYU-v2 and 0.87 cm on Lu at a scale of

\times 16

, respectively. Moving to Table 2, the outcomes on the real-world RGB-D-D dataset are presented. Following a methodology akin to established techniques such as those of FDSR [4], DCTNet [6], and SGNet [8], we initially leverage a model pre-trained on NYU-v2 for real-world RGB-D-D depth prediction without fine-tuning, shown in the first row of Table 2. Subsequently, after retraining and testing on the real-world dataset, as shown in the second row, our DMFNet still demonstrates superiority over its counterparts. For instance, DMFNet achieves a reduction of 0.5 cm compared to the suboptimal SGNet [8]. These findings underscore the efficacy of our approach in enhancing DSR performance significantly. Moreover, the degradation patterns of real-world scenes are usually unknown and unconventional. DMFNet’s good performance on the real-world RGB-D-D dataset proves its ability of generalization and degradation modeling.

Visual Comparison. Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 showcase the visual comparative analysis on both the synthetic and real-world datasets, highlighting the prowess of our depth prediction method. For example, the visual results on the four synthetic datasets indicate that the proposed DMFNet excels in edge precision and error reduction. Notably, in Figure 5, our method distinctly outlines the person’s arm edges, surpassing alternative techniques with minimized errors. Contrary to synthetic data, real-world low-resolution depth often suffers from significant structural distortions. As evidenced in Figure 8, our approach outshines competitors in accurately recovering edges. Specifically, the hand and clothes edges predicted by DMFNet align remarkably well with true depth data. Overall, these visual validations underscore the exceptional efficacy of our method in high-resolution depth restoration, emphasizing its practical relevance and superior performance.

Joint DSR and Denoising. Table 3 clearly illustrates the superior performance of our method in the realm of joint DSR and denoising across the NYU-v2 and Middlebury datasets, outshining competing methodologies. The inherent noise present in depth data gathered from real-world settings presents a significant obstacle to achieving high-quality HR depth restoration. In alignment with prevailing techniques such as those introduced by DKN [3] and DAGF [61], we introduce Gaussian noise with a variance of 25 to the LR depth during input processing. As demonstrated in Table 3, our DMFNet significantly outperforms the next best alternative, reducing the RMSE by 0.35 cm (×8) and 0.58 cm (×16) on the NYU-v2 dataset, and by 0.18 cm (×8) and 0.38 cm (×16) on the Middlebury dataset. Moreover, Figure 9 visually presents the outcomes on the NYU-v2 dataset, showcasing the remarkable accuracy of the depth edge predictions achieved by DMFNet. Notably, the delineation of the lamp and bed ladder exemplifies the clarity and precision that sets our approach apart from its counterparts. These compelling results solidify the position of DMFNet as a method that not only excels in performance but also showcases robust generalization capabilities.

4.4. Ablation Study

Table 4 reports the ablation study of each key component in DMFNet, including the MFB and degradation learning technique. DMFNet-i is the baseline that only contains the depth restoration branch, where the MFB is replaced by the addition operation and the loss is

L_{1}

only. The results presented in Table 4 clearly demonstrate the benefits of incorporating both the MFB and degradation learning into our DMFNet. For example, compared to the baseline DMFNet-i, the MFB reduces the RMSE by 0.16 cm and 0.11 cm on the NYU-v2 and Lu, respectively. When propagating the degradation representation into the MFB, the errors are further reduced, demonstrating its significant effectiveness. Finally, the combination of the MFB with degradation representation and degradation loss contributes 0.38 cm and 0.32 cm improvements in RMSE in total. The visual results presented in Figure 10 further highlight the advantages of our approach. Both the MFB and degradation learning enable the model to generate depth features with more distinctive edges than the baseline. Moreover, when these two parts are used in combination, DMFNet produces a much clearer scene structure, suggesting that the integration of these modules can effectively enhance depth restoration.

5. Conclusions

In this study, we introduced DMFNet as a novel approach for improving DSR performance. By explicitly modeling degradation and implementing comprehensive multi-modal fusion, DMFNet offers significant advancements in the restoration of high-resolution depth maps from low-resolution inputs. Our framework incorporated the deep degradation regularization loss function to capture complex degradation patterns and enhance the restoration process by explicitly estimating and utilizing degradation kernels. Additionally, the multi-modal fusion block facilitates multi-domain guidance by transforming color images and depth maps into spectrum representations, enabling effective fusion of RGB-D and degradation information. Through extensive experiments on benchmark datasets such as NYU-v2, Middlebury, Lu, and RGB-D-D, we demonstrated that DMFNet surpasses existing approaches and achieves state-of-the-art performance in depth map super-resolution tasks. The results showcase DMFNet’s ability to restore sharper and clearer depth maps, closely resembling ground truth data and outperforming competitors.

In summary, our contributions include the proposal of a degradation-guided framework that explicitly models LR depth degradation, the design of a multi-modal fusion block for enhanced multi-domain interactions, and the achievement of outstanding performance in DSR, establishing DMFNet as a leading solution in the field.

Author Contributions

Conceptualization, L.H.; methodology, L.H. and X.W.; software, L.H. and D.W.; validation, L.H. and D.W.; formal analysis, L.H.; investigation, L.H.; resources, L.H.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, L.H., X.W., F.Z. and D.W.; visualization, L.H.; supervision, X.W. and F.Z.; project administration, L.H.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to privacy.

Acknowledgments

The authors express their gratitude to the anonymous reviewers and the editor.

Conflicts of Interest

Author Mr. Diansheng Wu was employed by the Wiscom System Co., Ltd. China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict.

References

Hui, T.W.; Loy, C.C.; Tang, X. Depth map super-resolution by deep multi-scale guidance. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 353–369. [Google Scholar]
Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep joint image filtering. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 154–169. [Google Scholar]
Kim, B.; Ponce, J.; Ham, B. Deformable kernel networks for joint image filtering. Int. J. Comput. Vis. 2021, 129, 579–600. [Google Scholar] [CrossRef]
He, L.; Zhu, H.; Li, F.; Bai, H.; Cong, R.; Zhang, C.; Lin, C.; Liu, M.; Zhao, Y. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 9229–9238. [Google Scholar]
Yan, Z.; Wang, K.; Li, X.; Zhang, Z.; Li, G.; Li, J.; Yang, J. Learning complementary correlations for depth super-resolution with incomplete data in real world. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5616–5626. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Zhang, J.; Xu, S.; Lin, Z.; Pfister, H. Discrete cosine transform network for guided depth map super-resolution. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 5697–5707. [Google Scholar]
Metzger, N.; Daudt, R.C.; Schindler, K. Guided Depth Super-Resolution by Deep Anisotropic Diffusion. In Proceedings of the CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 18237–18246. [Google Scholar]
Wang, Z.; Yan, Z.; Yang, J. Sgnet: Structure guided network via gradient-frequency awareness for depth map super-resolution. In Proceedings of the AAAI, Vancouver, BC, Canada, 20–27 February 2024; pp. 5823–5831. [Google Scholar]
Im, S.; Ha, H.; Choe, G.; Jeon, H.G.; Joo, K.; Kweon, I.S. Accurate 3d reconstruction from small motion clip for rolling shutter cameras. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 775–787. [Google Scholar] [CrossRef] [PubMed]
Song, X.; Dai, Y.; Zhou, D.; Liu, L.; Li, W.; Li, H.; Yang, R. Channel attention based iterative residual learning for depth map super-resolution. In Proceedings of the CVPR, Seattle, WA, USA, 14–19 June 2020; pp. 5631–5640. [Google Scholar]
Yang, Y.; Cao, Q.; Zhang, J.; Tao, D. CODON: On orchestrating cross-domain attentions for depth super-resolution. Int. J. Comput. Vis. 2022, 130, 267–284. [Google Scholar] [CrossRef]
Yan, Z.; Li, X.; Wang, K.; Zhang, Z.; Li, J.; Yang, J. Multi-modal masked pre-training for monocular panoramic depth completion. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 378–395. [Google Scholar]
Yuan, J.; Jiang, H.; Li, X.; Qian, J.; Li, J.; Yang, J. Structure Flow-Guided Network for Real Depth Super-Resolution. arXiv 2023, arXiv:2301.13416. [Google Scholar] [CrossRef]
Yan, Z.; Wang, K.; Li, X.; Zhang, Z.; Li, J.; Yang, J. Desnet: Decomposed scale-consistent network for unsupervised depth completion. In Proceedings of the AAAI, Singapore, 17–19 July 2023; pp. 3109–3117. [Google Scholar]
Yan, Z.; Li, X.; Wang, K.; Chen, S.; Li, J.; Yang, J. Distortion and uncertainty aware loss for panoramic depth completion. In Proceedings of the ICML, Honolulu, HI, USA, 23–29 July 2023; pp. 39099–39109. [Google Scholar]
Siemonsma, S.; Bell, T. N-DEPTH: Neural Depth Encoding for Compression-Resilient 3D Video Streaming. Electronics 2024, 13, 2557. [Google Scholar] [CrossRef]
Li, L.; Li, X.; Yang, S.; Ding, S.; Jolfaei, A.; Zheng, X. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Trans. Ind. Inform. 2020, 17, 3920–3928. [Google Scholar] [CrossRef]
Yuan, J.; Jiang, H.; Li, X.; Qian, J.; Li, J.; Yang, J. Recurrent Structure Attention Guidance for Depth Super-Resolution. arXiv 2023, arXiv:2301.13419. [Google Scholar] [CrossRef]
Sun, B.; Ye, X.; Li, B.; Li, H.; Wang, Z.; Xu, R. Learning scene structure guidance via cross-task knowledge transfer for single depth super-resolution. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021; pp. 7792–7801. [Google Scholar]
Min, H.; Cao, J.; Zhou, T.; Meng, Q. IPSA: A Multi-View Perception Model for Information Propagation in Online Social Networks. Big Data Min. Anal. 2024. [Google Scholar]
Zhou, M.; Yan, K.; Pan, J.; Ren, W.; Xie, Q.; Cao, X. Memory-augmented deep unfolding network for guided image super-resolution. Int. J. Comput. Vis. 2023, 131, 215–242. [Google Scholar] [CrossRef]
Qin, S.; Xiao, J.; Ge, J. Dip-NeRF: Depth-Based Anti-Aliased Neural Radiance Fields. Electronics 2024, 13, 1527. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the CVPR, Seattle, WA, USA, 14–19 June 2020; pp. 11621–11631. [Google Scholar]
Yan, Z.; Wang, K.; Li, X.; Zhang, Z.; Li, J.; Yang, J. RigNet: Repetitive image guided network for depth completion. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 214–230. [Google Scholar]
Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
Yan, X.; Xu, S.; Zhang, Y.; Li, B. ELEvent: An Abnormal Event Detection System in Elevator Cars. In Proceedings of the CSCWD, Tianjin, China, 8–10 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 675–680. [Google Scholar]
Liu, Z.; Wang, Q. Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation. Electronics 2024, 13, 1652. [Google Scholar] [CrossRef]
Zeng, J.; Zhu, Q. NSVDNet: Normalized Spatial-Variant Diffusion Network for Robust Image-Guided Depth Completion. Electronics 2024, 13, 2418. [Google Scholar] [CrossRef]
Tang, J.; Tian, F.P.; An, B.; Li, J.; Tan, P. Bilateral Propagation Network for Depth Completion. In Proceedings of the CVPR, Seattle WA, USA, 17–21 June 2024; pp. 9763–9772. [Google Scholar]
Yan, Z.; Lin, Y.; Wang, K.; Zheng, Y.; Wang, Y.; Zhang, Z.; Li, J.; Yang, J. Tri-Perspective View Decomposition for Geometry-Aware Depth Completion. In Proceedings of the CVPR, Seattle WA, USA, 17–21 June 2024; pp. 4874–4884. [Google Scholar]
Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1909–1923. [Google Scholar] [CrossRef] [PubMed]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the ECCV, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Chen, Z.; Ji, X. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion. IEEE Trans. Image Process. 2021, 31, 648–663. [Google Scholar] [CrossRef]
Shi, W.; Ye, M.; Du, B. Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution. In Proceedings of the ACMMM, Lisbon, Portugal, 10–14 October 2022; pp. 3867–3876. [Google Scholar]
Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3333–3348. [Google Scholar] [CrossRef] [PubMed]
Tang, Q.; Cong, R.; Sheng, R.; He, L.; Zhang, D.; Zhao, Y.; Kwong, S. Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation. In Proceedings of the ACM MM, Virtual Event, 20–24 October 2021; pp. 2148–2157. [Google Scholar]
De Lutio, R.; Becker, A.; D’Aronco, S.; Russo, S.; Wegner, J.D.; Schindler, K. Learning graph regularisation for guided super-resolution. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 1979–1988. [Google Scholar]
Zhou, M.; Huang, J.; Li, C.; Yu, H.; Yan, K.; Zheng, N.; Zhao, F. Adaptively learning low-high frequency information integration for pan-sharpening. In Proceedings of the ACM MM, Lisboa, Portugal, 10–14 October 2022; pp. 3375–3384. [Google Scholar]
Zhou, M.; Huang, J.; Yan, K.; Yu, H.; Fu, X.; Liu, A.; Wei, X.; Zhao, F. Spatial-frequency domain information integration for pan-sharpening. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022; pp. 274–291. [Google Scholar]
Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Focal frequency loss for image reconstruction and synthesis. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 13919–13929. [Google Scholar]
Mao, X.; Liu, Y.; Liu, F.; Li, Q.; Shen, W.; Wang, Y. Intriguing findings of frequency selection for image deblurring. In Proceedings of the AAAI, Singapore, 17–19 July 2023; pp. 1905–1913. [Google Scholar]
Lin, X.; Li, Y.; Hsiao, J.; Ho, C.; Kong, Y. Catch Missing Details: Image Reconstruction with Frequency Augmented Variational Autoencoder. In Proceedings of the CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 1736–1745. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 2472–2481. [Google Scholar]
Ahn, N.; Kang, B.; So Kweon, I. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
Kim, J.; Kwon Lee, J.; Mu Lee, K. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 1637–1645. [Google Scholar]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y. Deep information-preserving network for image super-resolution. IEEE Trans. Image Process. 2020, 29, 1031–1042. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 2790–2798. [Google Scholar]
Huang, Y.; He, R.; Sun, Z.; Tan, T. Deep edge-guided network for single image super-resolution. In Proceedings of the CVPR, Seattle, WA, USA, 14–19 June 2020; pp. 1378–1387. [Google Scholar]
Liu, X.; Liu, W.; Li, M.; Zeng, N.; Liu, Y. Deep attention-aware network for color image super-resolution. In Proceedings of the ICIP, Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 464–468. [Google Scholar]
Wang, L.; Wang, Y.; Dong, X.; Xu, Q.; Yang, J.; An, W.; Guo, Y. Unsupervised degradation representation learning for blind super-resolution. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 10581–10590. [Google Scholar]
Liang, J.; Zeng, H.; Zhang, L. Efficient and degradation-adaptive network for real-world image super-resolution. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 574–591. [Google Scholar]
Zhang, K.; Zuo, W.; Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3262–3271. [Google Scholar]
Gu, J.; Lu, H.; Zuo, W.; Dong, C. Blind super-resolution with iterative kernel correction. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 1604–1613. [Google Scholar]
Hirschmuller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. In Proceedings of the CVPR, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
Scharstein, D.; Pal, C. Learning conditional random fields for stereo. In Proceedings of the CVPR, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
Lu, S.; Ren, X.; Liu, F. Depth enhancement via low-rank matrix completion. In Proceedings of the CVPR, Columbus, OH, USA, 23–28 June 2014; pp. 3390–3397. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Su, H.; Jampani, V.; Sun, D.; Gallo, O.; Learned-Miller, E.; Kautz, J. Pixel-adaptive convolutional neural networks. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 11166–11175. [Google Scholar]
Guo, C.; Li, C.; Guo, J.; Cong, R.; Fu, H.; Han, P. Hierarchical features driven residual learning for depth map super-resolution. IEEE Trans. Image Process. 2018, 28, 2545–2557. [Google Scholar] [CrossRef]
Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Ji, X. Deep attentional guided image filtering. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 12236–12250. [Google Scholar] [CrossRef]
Zhao, Z.; Zhang, J.; Gu, X.; Tan, C.; Xu, S.; Zhang, Y.; Timofte, R.; Van Gool, L. Spherical space feature decomposition for guided depth map super-resolution. arXiv 2023, arXiv:2303.08942. [Google Scholar]
Wang, Z.; Yan, Z.; Yang, M.H.; Pan, J.; Yang, J.; Tai, Y.; Gao, G. Scene Prior Filtering for Depth Map Super-Resolution. arXiv 2024, arXiv:2402.13876. [Google Scholar]

Figure 1. Visual comparison example on NYU-v2 dataset [32]. (a) Color image input, (b) low-resolution depth input, (c) ground truth (GT) depth, (d) DKN [3], (e) DCTNet [6], and (f) our proposed DMFNet. The visualization and error comparison demonstrates the superior performance of our DMFNet in restoring clear and accurate depth results.

Figure 2. An overview of the proposed DMFNet, which consists of the degradation learning branch and depth restoration branch. The former branch employs the Deep Degradation Regularization Module (DDRM) to gradually learn explicit degradation from the LR depth, while the latter branch restores fine-grained depth via the Multi-modal Fusion Block (MFB) and the degradation constraint.

Figure 3. Scheme of the proposed multi-modal fusion block (MFB).

Figure 4. Visual results on the synthetic NYU-v2 dataset (

\times 16

Figure 4. Visual results on the synthetic NYU-v2 dataset (

\times 16

Figure 5. Visual results on the synthetic RGB-D-D dataset (

\times 16

Figure 5. Visual results on the synthetic RGB-D-D dataset (

\times 16

Figure 6. Visual results on the synthetic Lu dataset (

\times 16

Figure 6. Visual results on the synthetic Lu dataset (

\times 16

Figure 7. Visual results on the synthetic Middlebury dataset (

\times 16

Figure 7. Visual results on the synthetic Middlebury dataset (

\times 16

Figure 8. Visual results on the real-world RGB-D-D dataset.

Figure 9. Denoising visual results on the synthetic NYU-v2 dataset.

Figure 10. Visual comparison of the intermediate depth features on RGB-D-D dataset (

\times 4

Figure 10. Visual comparison of the intermediate depth features on RGB-D-D dataset (

\times 4

Table 1. Quantitative evaluation with state-of-the-art approaches on the four synthetic datasets. The best and second best results are marked in bold and blue, respectively. #P refers to parameters.

Method	NYU-v2			RGB-D-D			Lu			Middlebury			#P (M)
Method	$\times 4$	$\times 8$	$\times 16$	$\times 4$	$\times 8$	$\times 16$	$\times 4$	$\times 8$	$\times 16$	$\times 4$	$\times 8$	$\times 16$	#P (M)
DJF [2]	2.80	5.33	9.46	3.41	5.57	8.15	1.65	3.96	6.75	1.68	3.24	5.62	0.08
DJFR [31]	2.38	4.94	9.18	3.35	5.57	7.99	1.15	3.57	6.77	1.32	3.19	5.57	0.08
PAC [59]	1.89	3.33	6.78	1.25	1.98	3.49	1.20	2.33	5.19	1.32	2.62	4.58	-
CUNet [35]	1.92	3.70	6.78	1.18	1.95	3.45	0.91	2.23	4.99	1.10	2.17	4.33	0.21
DSRNet [60]	3.00	5.16	8.41	-	-	-	1.77	3.10	5.11	1.77	3.05	4.96	45.49
DKN [3]	1.62	3.26	6.51	1.30	1.96	3.42	0.96	2.16	5.11	1.23	2.12	4.24	1.16
FDKN [3]	1.86	3.58	6.96	1.18	1.91	3.41	0.82	2.10	5.05	1.08	2.17	4.50	0.69
FDSR [4]	1.61	3.18	5.86	1.16	1.82	3.06	1.29	2.19	5.00	1.13	2.08	4.39	0.60
DAGF [61]	1.36	2.87	6.06	-	-	-	0.83	1.93	4.80	1.15	1.80	3.70	2.44
GraphSR [37]	1.79	3.17	6.02	1.30	1.83	3.12	0.92	2.05	5.15	1.11	2.12	4.43	32.53
DMFNet	1.17	2.43	4.88	1.16	1.75	2.62	0.91	1.77	3.93	1.07	1.74	3.15	0.64

Table 2. Quantitative evaluation on the real-world RGB-D-D dataset.

Train	FDSR [4]	DCTNet [6]	SUFT [34]	SSDNet [62]	SGNet [8]	SPFNet [63]	DMFNet
NYU-v2	7.50	7.37	7.22	7.32	7.22	7.23	7.15
RGB-D-D	5.49	5.43	5.41	5.38	5.32	4.63	4.13

The bold indicates the best result while the blue refers to the second best result.

Table 3. Comparison of joint DSR and denoising on NYU-v2 and Middlebury datasets.

Scale	DJF [2]	DJFR [31]	DSRNet [60]	PAC [59]	DKN [3]	DAGF [61]	SPFNet	DMFNet
NYU-v2
$\times 4$	3.74	4.01	4.36	4.23	3.39	3.25	3.45	2.92
$\times 8$	5.95	6.21	6.31	6.24	5.24	5.01	5.15	4.63
$\times 16$	9.61	9.90	9.75	9.54	8.41	7.54	7.94	7.12
Middlebury
$\times 4$	1.80	1.86	1.84	1.81	1.76	1.72	1.67	1.64
$\times 8$	2.99	3.07	2.99	2.94	2.68	2.61	2.61	2.48
$\times 16$	5.16	5.27	4.70	5.08	4.55	4.24	4.24	3.90

The bold indicates the best result while the blue refers to the second best result.

Table 4. Ablation study of DMFNet on NYU-v2 and Lu datasets (

\times 16

Table 4. Ablation study of DMFNet on NYU-v2 and Lu datasets (

\times 16

DMFNet	MFB	Degradation Representation d	Degradation Loss $L_{\deg}$	NYU-v2	Lu
i				5.26 ( $\pm 0.00$ )	4.25 ( $\pm 0.00$ )
ii	✓			5.10 ( $- 0.16$ )	4.14 ( $- 0.11$ )
iii	✓	✓		5.04 ( $- 0.22$ )	4.07 ( $- 0.18$ )
iv	✓	✓	✓	4.88 ( $- 0.38$ )	3.93 ( $- 0.32$ )

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, L.; Wang, X.; Zhou, F.; Wu, D. Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution. Electronics 2024, 13, 4020. https://doi.org/10.3390/electronics13204020

AMA Style

Han L, Wang X, Zhou F, Wu D. Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution. Electronics. 2024; 13(20):4020. https://doi.org/10.3390/electronics13204020

Chicago/Turabian Style

Han, Lu, Xinghu Wang, Fuhui Zhou, and Diansheng Wu. 2024. "Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution" Electronics 13, no. 20: 4020. https://doi.org/10.3390/electronics13204020

APA Style

Han, L., Wang, X., Zhou, F., & Wu, D. (2024). Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution. Electronics, 13(20), 4020. https://doi.org/10.3390/electronics13204020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Degradation-Guided Multi-Modal Fusion Network for Depth Map Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. Depth Map Super-Resolution

2.2. Frequency Learning

2.3. Degradation Learning

3. Method

3.1. Network Architecture

3.2. Depth Degradation Learning

3.3. Multi-Modal Fusion Block

3.4. Loss Function

4. Experiment

4.1. Dataset

4.2. Metric and Implementation Detail

4.3. Comparison with SoTA Methods

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI