1 Introduction

Depth information is pivotal in many applications, from digital entertainment to virtual and augmented reality [21]. It is the backbone for digital object and environment modeling [8, 42] and cost-effective motion capture solutions [18].

Pose estimation derived from depth data finds utility in diverse fields such as physiotherapy [5, 17], video surveillance [34, 63], and human–computer interaction [46]. Depth data also aids autonomous navigation [15] and enhances security measures through facial recognition [43].

Consumer depth devices, often employing low-cost LiDAR, structured light, or time-of-flight technologies, are instrumental in these applications. Among these, the Microsoft Kinect v2 stands out for its balance of quality, availability, and affordability. However, consumer-grade sensors like Kinect v2 still grapple with noisy and incomplete data issues.

Efforts to address these quality issues span traditional smoothing techniques to data-driven machine learning algorithms. Many adopt supervised learning with neural networks, training models on noisy-clean data pairs \(({\hat{x}}, y)\) to minimize empirical risk.

However, acquiring clean training data is non-trivial. Recent attention has thus shifted towards self-supervised techniques, such as Noise2Noise [27], which leverages noisy-noisy data pairs \(({\hat{x}}, {\hat{y}})\) for training, and minimizing the cost function \(g(\theta ) = {\textrm{argmin}}_\theta \sum _i^N {{{\mathcal {L}}}} \left( f_\theta \left( {\hat{x}}_i \right) , {\hat{y}}_i \right)\), where the network \(f_\theta \left( {\hat{x}}_i \right)\) is parameterize by \(\theta\).

Despite their efficacy in various domains, self-supervised methods for depth data restoration remain underexplored, largely due to the intricate noise patterns in consumer-grade sensors.

Our paper introduces SelfReDepth (SReD), a novel self-supervised, real-time depth data restoration technique optimized for the Kinect v2. SelfReDepth introduces a convolutional autoencoder architecture inspired by U-Net, specifically designed to process sequential depth frames efficiently. This design choice directly responds to the need for maintaining temporal coherence in dynamic scenes, a gap often left unaddressed by traditional single-frame denoising approaches. Furthermore, SelfReDepth incorporates RGB data into the depth restoration process as an innovative way to enhance the accuracy of inpainting missing pixels by providing contextual color information. This method significantly improves the restoration quality by providing additional context that depth data alone lacks. Our contributions are fourfold: (1) We employ a convolutional autoencoder with an architecture akin to U-Net [47] to process sequential frames. (2) Our method achieves real-time performance and temporal coherence by adopting a video-centric approach. (3) We incorporate RGB data to guide an inpainting algorithm during training, enhancing the model’s ability to complete missing depth pixels. (4) Our approach maintains a 30 fps real-time rate while outperforming state-of-the-art techniques.

2 Background and related work

In recent years, depth-sensing technology has emerged as a pivotal tool in various applications, from gaming to augmented reality and robotics. The promise of capturing the third dimension, depth, has opened up new horizons in computer vision, augmented reality, and human–computer interaction. Next, we introduce some concepts and methodologies related to the present work.


Denoising vs. inpainting: The distinction between denoising and inpainting is important to be stressed, as these terms will be used throughout this work constituting important stages of the proposed methodology. Denoising and inpainting are two core image processing problems. As the name suggests, denoising removes noise from an observed noisy image, while inpainting aims to estimate missing image pixels. Both denoising and inpainting are inverse problems: the common goal is to infer an underlying image from incomplete/imperfect observations. Formally, in both problems the observed image \(\textbf{Y}\in {\mathbb {R}}^{M'\times N'}\) is modeled as \(\textbf{Y}= {{{\mathcal {F}}}}(\textbf{X}) + \eta\) where \(\textbf{X}\in \mathbb {R^{M\times N}}\) is the unknown (original) image and \(\eta\) is the observed noise. The difference between the denoising and the inpainting emerges from the mapping \({{{\mathcal {F}}}}: {\mathbb {R}}^{M\times N} \mapsto {\mathbb {R}}^{M'\times N'}\) that expresses a linear degradation operator that could represent a convolution or a masking process. In concrete, the denoising process means that \({{{\mathcal {F}}}}\) is an identity projector, having \({{{\mathcal {F}}}}=\textbf{H}\), such that \(\textbf{H}=\textbf{I}\), (with \(\textbf{I}\) the identity matrix). In the inpainting process, \({{{\mathcal {F}}}}\) is a selection operator. In practice, this corresponds to having the same \(\textbf{H}\) as before. However, this matrix only contains a subset of the rows of \(\textbf{I}\), accounting for the loss of pixels.

Having formally defined the two concepts, we stress that both are used in restoration problems, as we propose in this work. It follows that, in this paper, two major contributions are offered for restoration, concretely: (i) a new denoising method. Contrasting with Noise2Noise [27] that applies denoising in traditional images, we extending the framework to depth images that requires a new learning strategy to handle depth information (see top branch in Fig. 2 and Sect. 3.2), and (ii) inpainting approach where we integrate a newtwo-stage pipeline comprising a RGB-Depth registration and a fast marching method stages (see bottom branch in Fig. 2).

Fig. 1
figure 1

Multi-frame-to-frame (MF2F) [13] architecture with distinct strategies from the training and inference steps


Problems with low-quality depth: Despite all the progress made in in-depth sensing hardware for consumer devices, depth cameras (such as the Kinect v2) still suffer from many of the same problems that previous iterations also did, namely, noisy measurements and depth holes [38].

These depth holes typical of time-of-flight (ToF) devices have multiple causes [20, 50] including: (1) measuring regions that are outside the distance range of the sensor, (2) highly reflective objects in the scene, (3) measurements near the edges of the camera’s field-of-view (FoV). Smaller holes, on the other hand, can appear in one of two types: (1) isolated points caused by physical and lighting interferences on the sensor, (2) thin outlines around objects due to the scattering of infrared rays at shallow angles and sharp edges.

Besides missing depth values, the measurement inaccuracy in the successfully captured points is a concerning issue, leading to noisy depth maps. The noise produced by Kinect v2 is significantly less severe than in the previous generation. Nevertheless, it is still very much present. Experimental analysis [26, 54, 61] has shown that there is a direct correlation between the noise observed in Kinect v2’s depth maps and various physical factors, including distance, angle, material, color, warm-up time, to quote a few. Furthermore, there is a general consensus [1, 12, 25] that this noise can be described as the sum of two different sources of noise: (1) random noise, associated with pixel-based local distortions caused by physical factors like color and others mentioned above, (2) systematic bias, associated with the wiggling error that radially increases as the measurements get closer to the edges of the sensor’s FoV.


Deterministic denoisers, or manual denoising algorithms that do not rely on machine learning were the first noise reduction techniques to be developed targeting depth data [14]. These can generally be divided into three main categories: (i) filter-based denoisers, (ii) outlier removal techniques, and (iii) calibration methods.

Filter-based denoisers work by applying smoothing and sharpening filters, such as bilateral filters [37, 62], joint bilateral filters [9, 10], anisotropic filters [35] and zero block filters [35], to leverage spatial pixel neighborhoods through sliding pixel windows (i.e. kernels). The preservation of edge sharpness is particularly difficult to achieve using filters; thus, some denoisers introduce specialized techniques, such as RGB-D alignment [10] and contextual image partitioning [9].

Other works, instead, focus exclusively on removing incorrect or low-quality depth points rather than correcting them. This can be seen, for instance, in Ref. [12] for cleaning hand depth scans and in Ref. [57] for cleaning body scans, later merged to form a complete body point cloud. Finally, some works tackle the denoising problem from a calibration standpoint, focusing on alleviating systematic errors affecting consumer-grade sensors by fitting planes or splines to the raw measurements [25, 26] or generating specialized noise correction maps [1].


Self-supervised denoisers: Noise2Noise [27] pioneered self-supervised image denoising, showing that a denoising model trained with only noisy data can achieve quality results on par with supervised learning strategies. The shift in the learning method from supervision to self-supervision resides primarily in the training data. Specifically, Noise2Noise [27] uses input-target pairs of the form \(\left( {\hat{x}},{\hat{y}}\right) =\left( x+n_1,x+n_2\right)\), where x is the base signal (the undamaged data that we want to uncover) and \(n_1\) and \(n_2\) are two independent noise instances following the same statistical distribution. From the above, the Noise2Noise strategy differs from the noise-clean data pairs \(\left( {\hat{x}},y\right)\) used in supervised learning and has the advantage of not requiring clean target images.

Nonetheless, Noise2Noise [27] has some data limitations, encouraging subsequent works to propose further improvements. Towards this challenge, [4] proposes two data permutation techniques to increase the number of noisy training pairs. On the other hand, Noise2Void [23] eliminates the need for quasi-similar input-target pairs (not always easy to obtain) by training the model to predict a central pixel using noisy-void training pairs \(\left( {\hat{x}},-\right)\) and a blind-spot mask to avoid learning the identity. Noise2Self [2] later expanded on this by proposing a more generalized model.

Going further, some works, namely Probabilistic Noise2Void [24], SURE [39], Noisier2Noise [41] and NoiseBreaker [28], managed to improve denoising for specific distributions. On the other hand, GAN2GAN [6] combines a generative model, Self2Self [45] introduces training with a single data sample, and GainTuning [40] proposes an ever-adapting model.


Spatio-temporal denoisers are an extension of image denoising where the coherence of temporal locality is also considered to provide visual continuity in the final denoised videos. Likewise, clean data may also not be easy to obtain for this task, and thus, blind training comes with great interest. A simple multi-frame self-supervised strategy for denoising can be extrapolated from Noise2Stack [44], in which a self-supervised approach is proposed to denoise MRI data using adjacent sets from a stack of layered MRI brain scans.

Self-supervised denoising techniques targeting color videos have also been developed, using the multi-frame input concept described in Noise2Stack [44] combined with additional self-supervision techniques. Multi-Frame-to-Frame (MF2F) [13] (see Fig. 1) takes the FastDVDNet [52] supervised video denoising network, composed by cascaded U-Net [47] autoencoders, and applies its own self-supervised loss.

Similarly, UDVD [48] uses a cascaded structure akin to FastDVDNet [52] but performs the network pass 4-fold, each with the input frames at a different rotation (\(0^{\circ }\), \(90^{\circ }\), \(180^{\circ }\) and \(270^{\circ }\)). The four outputs generated, one for each of the four rotations, are then rotated back to 0\(^{\circ }\) and combined to form the final production.


Depth completion: Alongside inaccurate data points, low-quality depth maps also suffer from missing or invalid data. Depth completion, also known as hole-filling, is a well-known and vastly researched area that falls under the umbrella of image inpainting [32, 59, 60].

The effect of using Noise2Noise and similar algorithms over images with missing data without any prior inpainting is that the majority of depth holes remain untreated in the final images, and even in methods that deal with multiple consecutive frames, there is insufficient data to fill these gaps in most cases.

As in-depth denoising, depth completion has been approached using traditional and deep neural methodologies. Traditional techniques typically rely on either filtering algorithms, which classify the holes and apply dedicated filters, such as PDJB, DJBF, and FCRN, or boundary-extending algorithms, based on FMM or the Navier–Stokes equation [3].

The fast marching method (FMM) inpainting, in particular, was originally proposed for color image inpainting [53] and works by progressively shrinking the boundaries of hole regions inwards, until all pixels have been filled, using the equation

$$\begin{aligned} I(p) = \dfrac{\sum \nolimits _{q \in N(p)} {w(p, q) \cdot \left[ I(q) + \nabla I(q) \cdot (p-q)\right] }}{\sum \nolimits _{q \in N(p)} {w(p, q)}} \end{aligned}$$
(1)

where p is the pixel being inpainted, N(p) is a neighborhood pixels of p, w(pq) is a function that determines how much pixel q contributes to the inpainting of p, I, and \(\nabla I\) represent the image and discrete gradient of the image, respectively. This algorithm was extended to depth completion, introducing improvements like the use of aligned color as a guiding factor for the weight function and to define the order of computations [33], and the use of a pixel-wise confidence factor [30].

Fig. 2
figure 2

Full overview of SelfReDepth’s architecture

Sparse depth maps, generally captured with LiDAR sensors, suffer especially from large patches of missing depth data and have particular time limitations, as they are commonly linked with autonomous driving. Thus, more advanced techniques have been developed, relying on both supervised [29] and self-supervised [11, 16, 36] deep convolutional neural networks assisted by color information to fill large depth gaps.

3 Our approach

Building upon recent advancements in self-supervised data denoising research, the proposed SReD offers a novel approach for denoising and inpainting low-quality depth maps. It leverages the flexibility and adaptability of deep learning models while eliminating the need for reference data - a highly desirable feature also found in deterministic denoisers. SReD was designed with a specific practical use case in mind, incorporating several additional requirements during its design and development. Specifically, our technique aims to: (i) denoise and restore as much of the initial depth maps as possible, (ii) operate with a single RGB-D device, (iii) facilitate direct sensor data streaming, and (iv) strive for temporal coherence and real-time performance.

Naturally, these requirements posed challenges that influenced the architectural decisions. For example, to achieve depth video denoising with temporal coherence, it is logical to design a method that utilizes multiple sequential frames, similar to MF2F [13]. However, given real-time constraints, only frames up to the most recent one are considered. This contrasts with MF2F, which incorporates two subsequent frames at the cost of adding considerable lag. Additionally, an architecture with faster inference is preferable over a more complex one to meet real-time performance criteria.

3.1 SelfReDepth’s architecture

SReD’s architecture, particularly its neural network and learning method, takes inspiration from previous self-supervised denoisers, mainly Noise2Noise [27], MF2F [13] and Noise2Stack [44], adapting their proposed denoising models to the specific case of online denoising in depth map sequences. As depicted in Fig. 2, the architecture differs between its training and inference stages. The model uses a dilated input during the training stage, as proposed in MF2F [13]. The autoencoder is trained with noisy pairs \(({\hat{x}}, {\hat{y}}) = ([d_{t-4},d_{t-2},d_{t}],d_{t-1})\), where \(d_{t-k}\) is the depth frame at time instant \(t-k\), with \(k\in \{0,1,2,4\}\). This technique improves the denoising results during inference and prevents the network from learning the identity by hiding frame \(d_{t-1}\) from the input. Since \(d_{t}\) and \(d_{t-1}\) are consecutive time frames, it is also plausible to assume they are similar in content while having different instances of noise, making them a suitable image pair for the Noise2Noise-style training.

Moreover, noisy depth frames frequently have regions persistently composed of depth holes in both the input and target frames, making these regions impossible to “denoise” using a standalone denoising network. As such, during training, the target frame \(d_{t-1}\) is inpainted with an FMM inpainting algorithm guided by the registered color frame \(\textrm{RGB}_{t-1}\), providing a way for the denoiser to learn how to fill depth-holes.

In summary, SReD’s training architecture has two distinct main blocks, depicted in Fig. 2a: (i) a denoising convolutional autoencoder with dilated input; and (ii) a target generation pipeline responsible for creating inpainted targets. During inference (Fig. 2b), the target generation is removed, contributing to faster performance and the denoiser shifts to non-dilated input (i.e., taking the frames \([d_{t-2},d_{t-1},d_{t}]\)), estimating a denoised/inpainted instance of the frame \(d_{t}\).

3.2 Target generation

The denoiser requires a learning strategy to handle depth holes. To achieve this, SReD generates target frames through deterministic inpainting. This deterministic approach consists of two stages: (i) computing a registered RGB image and (ii) using the previous result to apply guided inpainting to the damaged depth frame. The selection of this strategy rests on three primary reasons: (1) The prevalence of depth holes in general consumer depth data is sufficiently low for a deterministic approach to yield acceptable inpainting results. (2) It avoids the need for reference data. (3) From a temporal performance perspective, it only introduces computational time during training.


RGB-D registration: RGB-D devices collect color and depth with physically separate sensors/cameras, and often also different resolutions and FoV. Therefore, aligning the RGB and depth frames simultaneously captured must be done with a registration algorithm and requires acquiring the extrinsic and intrinsic parameters of the device, namely:

  • Focal length \(f_d = \begin{bmatrix} f_{d,x}&f_{d,y} \end{bmatrix}^\top\) and principal point \(c_d = \begin{bmatrix} c_{d,x}&c_{d,y} \end{bmatrix}^\top\) of the depth/IR sensor,

  • Focal length \(f_{rgb} = \begin{bmatrix} f_{rgb,x}&f_{rgb,y} \end{bmatrix}^\top\) and principal point \(c_{rgb} = \begin{bmatrix} c_{rgb,x}&c_{rgb,y} \end{bmatrix}^\top\) of the RGB sensor,

  • Rotation matrix R, which encodes the rotation from the RGB sensor view to the depth/IR sensor,

  • Translation vector T translates from the RGB sensor’s position to the depth/IR sensor’s position.

Following [64], RGB-D registration is performed through a series of coordinate transformations that map depth values captured from the depth sensor’s point-of-view to color values in the RGB camera’s point-of-view. To achieve this, the depth data, given as a depth map, is first converted to a point format where for each pixel coordinate \(\begin{bmatrix} x_d&y_d \end{bmatrix}^\top\) exists a 3-d point \(X_d = \begin{bmatrix} x_d&y_d&z_d \end{bmatrix}^\top\) with \(z_d = \textrm{depth}(x_d, y_d)\), and then transformed from Depth Image Coordinate Space to RGB Image Coordinate Space, \(X_{rgb}\), using the following equalities:

$$\begin{aligned} X'_d= & {} \begin{bmatrix} x'_d \\ y'_d \\ z'_d \end{bmatrix} = \begin{bmatrix} \dfrac{ (x_d - c_{d,x}) \cdot z_d }{ f_{d,x} } \\ \dfrac{ (y_d - c_{d,y}) \cdot z_d }{ f_{d,y} } \\ z_d \end{bmatrix}\end{aligned}$$
(2)
$$\begin{aligned} X'_{rgb}= & {} R^{-1} \cdot (X'_d - T) \end{aligned}$$
(3)
$$\begin{aligned} X_{rgb}= & {} \begin{bmatrix} x_{rgb} \\ y_{rgb} \\ z_{rgb} \end{bmatrix} = \begin{bmatrix} \dfrac{ x'_{rgb} \cdot f_{rgb,x} }{ z'_{rgb} } + c_{rgb,x} \\ \dfrac{ y'_{rgb} \cdot f_{rgb,y} }{ z'_{rgb} } + c_{rgb,y} \\ z'_{rgb} \end{bmatrix} \end{aligned}$$
(4)

Points \(X_{rgb}\) in (4) can then be mapped to a 2D \(W_{rgb} \times H_{rgb}\) size image, forming a registered depth image. This process summarizes the standard registration algorithm. However, the computation of registered RGB images is needed for target generation. So the \(RGB \mapsto D\) mappings produced by Eqs. (2)–(4) are reversed to build a 2D \(W_{d} \times H_{d}\) image of color values instead. Of course, doing this still leaves depth holes with no RGB value attributed, weakening the whole purpose of performing RGB-D registration. To overcome this, depth holes are filled with pixel interpolation and blurring to create smooth transitions between edges of known depth regions.


Inpainting: After completing the RGB-D registration, we employ a color-guided fast marching method (FMM) inpainting algorithm to generate the target frames for training the denoising network. Following the original FMM inpainting technique [53], the algorithm starts by delineating the boundaries of all hole regions within the image. Subsequently, it performs inpainting from the outer pixels of these boundaries inwards, ensuring that all hole regions are filled.

Our FMM inpainting technique combines ideas presented in Refs. [30, 33, 53] and introduces novel elements that enable better results in consumer depth maps. Specifically, the pixel weighting function (see Eq. (5)) is different from the original FMM inpainting [53]. Concretely, we prioritize the distance factor \(w_{dst}\) while dropping the factors \(w_{lev}\) and \(w_{dir}\). Additionally, we include two novel weights: \(w_g\), relating to color guidance [33]; and conf, a confidence factor as in Ref. [30]. All these new insights contribute to the following novel functions:

$$\begin{aligned} w(p,q)= & {} w_{dst}^2(p,q) \cdot w_g(p,q) \cdot conf(q)\end{aligned}$$
(5)
$$\begin{aligned} w_{dst}(p,q)= & {} \dfrac{d_0^2}{\Vert p - q \Vert ^2} \end{aligned}$$
(6)
$$\begin{aligned} w_g(p,q)= & {} \exp \left( - \dfrac{\Vert G(p) - G(q) \Vert ^2}{2 \cdot \sigma _g^2} \right) \end{aligned}$$
(7)
$$\begin{aligned} \textrm{conf}(q)= & {} \dfrac{1}{1 + 2 \cdot T_\textrm{out}(q)} \end{aligned}$$
(8)

where \(d_0\) is the minimum inter-pixel distance, usually 1, G denotes the guiding image, and \(\sigma _g^2\) is its standard deviation. Additionally, T is a distance map that stores the distance of each pixel to the closest initial hole patch boundary, and \(T_\textrm{out}\) is a function that zeroes pixels in the set of initial holes \(\Omega\) and assigns T to the remaining.

Furthermore, as in GFMM [33], the pixel inpainting priority takes into account both the distance to the initial hole boundary, given by T(p), and the guidance value of neighboring pixels, so that homogeneous areas are inpainted before other regions more likely to be transitive or edge areas. However, the priority function used in SReD’s inpainting for target generation, \(\textrm{Pr}(p)\), introduces a new normalization variable \(T_\textrm{max}\), leading to the final equation:

$$\begin{aligned} \textrm{Pr}(p)= & {} (1 - \lambda ) \cdot \dfrac{T(p)}{T_\textrm{max}} + \lambda \cdot (1 - S_g(p))\end{aligned}$$
(9)
$$\begin{aligned} S_g(p)= & {} \dfrac{1}{\vert N(p) \vert } \cdot \sum \limits _{q \in N(p)} w_g(p,q), \end{aligned}$$
(10)

where \(S_g(p)\) (Eq. (10)) gives the local guide similarity at pixel p, \(\vert N(p) \vert\) denotes the number of known pixels in the neighborhood of p, \(T_{max}\) is the greatest value in distance map T, and \(\lambda\) is a mixing parameter. (Note: lower Pr values denote greater priority.)

3.3 Denoising network

The denoising neural network implemented in SReD adopts a convolutional autoencoder architecture based on the U-Net design [47]. This architecture is primarily influenced by features from MF2F [13], FastDVDnet [52], and Noise2Noise [27]. During inference, the network takes as input three sequential depth frames, specifically \(d_{t-2}, d_{t-1}, d_{t}\). Conversely, during training, the input frames are dilated, namely \(d_{t-4}, d_{t-2}, d_{t}\). This setup enables using an inpainted version of frame \(d_{t-1}\) as the target, thereby preventing the network from learning the identity function. The network employs the mean absolute error (MAE or \(L_1\)) loss function to measure the discrepancy between the inpainted target \(d^*_{t-1}\) and the input depth frame \(d_t\). In noisy regions, this approach replicates the effects of Noise2Noise [27], and for depth holes, the network learns inpainting techniques.

Using a U-Net [47] helps with image denoising. This is because the skip-connections enable passing higher frequency details from the encoding stage to the decoding stages via layer concatenation. This propagation allows the network to “flatten” noise areas while still preserving sharp image features, such as object contours. Regarding the layer structure, the general layout loosely follows the model presented in Noise2Noise [27], differing mainly in the number of channels at each network block and the downsampling/upsampling layers. Instead of max pooling and 2D upsampling layers, SReD uses 2D convolutions with stride two and transposed 2D convolutions, giving the model more learning flexibility.

Additionally, like in FastDVDnet [52], the network applies a final residual operation between the input frame \(d_{t}\) and the frame resulting from the last convolutional layer in the model \(d_\textrm{last}\), yielding the depth frame prediction \(d_\textrm{pred} = d_t - d_\textrm{last}\).

4 Evaluation

We conducted a comprehensive evaluation of our method to assess the algorithm’s performance. A quantitative evaluation was performed using a reference-independent noise metric to measure SReD’s depth-restoration capabilities objectively. We also performed several tests using a synthetic dataset that provides usable artificial ground-truth data. The algorithm’s time performance was also assessed to determine its suitability for real-time applications. Furthermore, we evaluated the method’s temporal coherence using a specialized metric. Finally, we compared SReD to other relevant reference-independent restoration algorithms to situate its performance within the broader landscape of available techniques.

Fig. 3
figure 3

Example scenes from the ground-truth dataset demonstrating (from left to right) RGB color image, real depth map, depth map with synthetic noise added (missing values in black color), results from the total variation method, SelfReDepth, and error map from our approach

4.1 Data and metrics

Identifying an appropriate combination of data and metrics for evaluating SReD proved to be a non-trivial task. Ideally, we would have access to a consumer-grade depth video dataset featuring raw frame sequences and reference depth data, perhaps captured using a high-precision laser sensor. However, such a dataset is not readily available. This very challenge underscores the importance of developing self-supervised depth denoisers like SReD.

We evaluated SReD on a depth video dataset devoid of reference depth. The evaluation also used appropriate reference-independent metrics. We conducted comprehensive tests on the CoRBS dataset [58], explicitly focusing on the Kinect v2 subset. These data include five distinct RGB-D frame sequences capturing a stationary scene with a mobile camera, resulting in an aggregate of approximately 14,000 depth frames. In terms of evaluation metrics, the denoised depth frames underwent quantitative assessment concerning noise through a “non-reference metric for image denoising” [22] (NMID), a robust measure based on structure similarity maps from both homogeneous and highly-structured regions, in the absence of the original clean data.

Additional tests relied on synthetic depth data from the InteriorNet dataset published in Ref. [31], which provides computer-rendered RGB and depth images for various indoor scenes. For evaluating this synthetic ground-truth data, we employed proper reference metrics for the comparisons: MSE, PSNR, and SSI. Since this dataset does not provide noisy data, we introduced synthetic noise using the developed Kinect v1 noise model from Handa et al. [19], which combines Gaussian noise, bilinear interpolation, and quantization to produce noisier pixels at higher distances and missing depth values at pixels whose corresponding normals are close to perpendicular to the camera’s viewing direction.

To further assess our approach’s feasibility on these data, we evaluated it against the total variation (TV) method [7]. This denoising technique reduces the total magnitude of the image’s color intensity gradient while simultaneously trying to keep object boundaries. The regularization parameter weight used in the algorithmic implementation from [56], which controls the denoising strength at the expense of fidelity to the original image, was set to 0.4 as this value maximized the mean scores among both datasets in our experiments.

Furthermore, we assess temporal coherence using straightforward image differences, \(M_\textrm{temp} = {\text {mean}}(I_{t+1} - I_{t})\). We address depth value oscillations over time by analyzing granular noise values on a frame-by-frame basis over contiguous video sequences from the dataset.

Fig. 4
figure 4

Temporal analysis of the mean depth value differences across frames. The noisy data were generated from a sample video sequence extracted from the InteriorNet synthetic dataset and restored using the SelfReDepth and total variation methods

4.2 Experimental setup

We ran all experiments on a Windows 10 desktop machine with an NVIDIA GeForce RTX 3080 GPU, a Ryzen 7 3700x 8-core CPU, 16 GB of RAM, and an SSD disk drive. We developed and tested SReD using Python 3.10.8 and tensorflow-gpu 2.10, along with CUDA 11.2 and cuDNN 8.1.

4.3 Results

We trained SReD with batch sizes of 16 and 200 epochs on the CoRBS [58] dataset, which we also thoroughly shuffled and set with validation and test splits of 0.1 and 0.04, respectively.

In the produced denoised depth maps, in Fig. 5, it can be seen that the model learned how to attenuate the noise in the original depth map and fill depth holes. Moreover, on the hardware used for evaluation, the model takes, on average, 9 ms to denoise each depth map. Given that a regular RGB-D sensor, such as the Kinect v2, records data at a frequency of 30 frames-per-second (33 ms per frame), this evaluation confirms that the model can achieve the desired real-time performance during inference.

Fig. 5
figure 5

Visual comparison between four image restoration algorithms applied to an example image taken from real data from the CoRBS dataset. From left to right: SelfReDepth, Noise2Noise, Noise2Stack and FMM + BF

We also compared SReD against other approaches, including two deep-learning methods, Noise2Stack [44] and Noise2Noise [27] and two deterministic approaches, the Total Variation method [7] and a combination of a pair of methods that applies FMM inpainting [53] followed by bilateral filtering [55] denoising. We chose Noise2Noise to evaluate how the implemented technique differs from the original self-supervised U-Net [47] denoiser and what benefits were secured by targeting specifically the denoising of depth. Similarly, we chose Noise2Stack [44] to compare SReD against another spatio-temporal depth denoiser. Last, we used the deterministic FMM + BF combination to evaluate how SReD fares against more traditional approaches that perform both denoising and inpainting.

As already mentioned, we used a non-reference noise metric, NMID [22], to quantify the denoising quality, and applied a direct image difference metric to evaluate temporal coherence on contiguous depth videos. The results can be seen in Table 1. From the measured values, we note that SReD attained promising results, rivaling the significantly more computationally expensive deterministic algorithms with the NMID metric and achieving the best results with the temporal coherence in both datasets as expected, since it relies on multiple consecutive frames for its inference process. In addition, results in Fig. 4 show that original per-frame noise discrepancies are mostly fixed, and yield temporarily consistent values after denoising.

As shown in Fig. 5, Noise2Noise and Noise2Stack can only perform pixel denoising and not depth completion. As for the deterministic algorithm combination, while it was capable of denoising and inpainting the depth maps, it can also be visually seen that both the edge preservation and depth completion results are inferior to those of SReD.

4.4 Discussion

Based on the experiments and metrics, SReD effectively learned to reduce noise in depth maps. However, some image details still need to be recovered, evident in the blurred features of the doll in the CoRBS [58] dataset. In these cases where large black depth holes are present, inpainting struggles to effectively reconstruct this missing data due to the absence of depth information, resulting in over-smoothed restorations. While the results are promising, they highlight the need for further work on detail preservation.

Additionally, our method seems to struggle with accurately restoring the depth of objects very close to the camera, as seen in the top row of Fig. 3 and its rightmost error profile. This scenario where the objects are almost touching the camera was not seen in the training data but is very frequent in the synthetic dataset, signaling the need to extend the training set to a wider range of scenarios.

The metrics pitched SReD promisingly against the other four algorithms evaluated. The visual analysis of the denoised data aligns well with the NMID metric values, reinforcing its reliability. Although not optimal, our method also performed very favorably in the synthetic ground-truth dataset regarding MSE, PSNR, and SSI scores. The subpar results on this synthetic dataset could be related to the a posteriori added synthetic noise based on a Kinect v1 noise model, while SReD was trained on Kinect v2 noise. This discrepancy, limited by the nonexistence of a usable v2 noise model implementation, might explain the better results achieved in this dataset by more general image restoration approaches not based on deep learning.

In qualitative terms, upon visual inspection, SReD achieved both consistent inpainting and denoising behavior and outperforms both deterministic approaches, namely the FMM + BF algorithm, particularly when filling missing areas and sharpening object boundaries, and TV, as this last method over-smooths the overall depth image, failing to preserve object details.

4.5 Real-time performance

Finally, the implemented algorithm can produce denoised frames at frequencies higher than 30 frames per second, thus making SReD suitable for real-time use. Indeed, on the computer hardware used for evaluation, the model requires, on average, 9 ms to denoise each depth map. Given that commercial off-the-shelf RGB-D sensors, such as the Kinect v2, generate data at 30 frames-per-second (33 ms per frame), our technique can achieve real-time performance at even higher frame rates. The modular design of the U-Net architecture allows for straightforward scalability to accommodate larger image sizes without a significant impact on computational time, thereby maintaining real-time performance, as will be detailed in the following section.


SelfReDepth complexity: Our approach is made up of three main blocks as follows: inpainting (Eqs. (1) and (5)), registration (Eqs. (2)–(4)) and denoising (U-Net network) procedures. As already stated in Sect. 3.1, our proposal is designed to satisfy real-time requirements, as we can modify the SReD architecture during the inference stage, which is the one that has a direct impact on the complexity budget. Specifically, (i) remove the target generation, and (ii) the denoise shift to non-dilated input. Thus, only the U-Net (denoiser) needs to be carefully addressed since it is the unique block that affects a constrained time budget requirement.

Table 1 Benchmark of several metrics for all the tested methods: NMID (higher is better), temporal difference (lower is better), PSNR (higher is better), MSE and SSI (lower is better)

Time complexity: We detail the architecture adaptation under a given complexity budget. The choice of the U-Net provides flexibility because it is possible to adapt its architecture under a predefined budget. The designs of network architectures should exhibit tradeoffs among several of its components, i.e. depth, numbers of filters, and filter sizes, from which the scalability is accomplished. From the above, depth is the most influential concerning the accuracy. Although it is not a straightforward observation, previous work [49, 51] has demonstrated its impact. The total time complexity of all the convolutional layers is given as \({{{\mathcal {O}}}}\Bigl ( \sum _{l=1}^d c_{l-1}\;.\; s_{l}^2\;.\; f_l\;.\; m_{l}^2 \Bigr )\) where d is the depth of the network, (i.e. the number of convolutional layers), l indexes the convolutional layer, \(c_{l}\) is the number of input channels in the l-th layer, \(f_l\) is the number of filters in the l-th layer, (i.e., the width), \(s_l\) and \(m_l\) are the spatial size of the filter and the size of the output of the feature map, respectively. The time cost of fully connected and pooling layers is not considered since these layers take about 5-10\(\%\) computational time. The time complexity above is the basis of the network designs, from which we consider the tradeoffs between the depth d and filter sizes \(f_l\), inferring how the network scales in time.

Concretely, we design a model by replacing the layers in our experimental evaluation. This means that when we replace a few layers with some other layers, we must guarantee that the complexity is preserved without changing the remaining layers in the architecture. To design such a replacement, we progressively modify the model and observe the changes in accuracy. Our method addresses the following tradeoffs:

  1. 1.

    depth d and filter sizes \(s_l\),

  2. 2.

    depth d and width \(f_l\), and

  3. 3.

    width \(f_l\) and filter sizes \(s_l\).

We illustrate one of the steps above, the remaining with an analogous procedure, only changing the corresponding parameters accordingly. For instance, as an illustrative of step 1 (tradeoff between depth d and filter size s), we replace a larger filter, say \(s_1\), with a cascade of smaller filters, say \(s_2\). Denoting the layer configuration as above, \(L_{\textrm{conf}} = c_{l-1}\,.\, s_{l}^2\,.\,f_l\) and considering two instances of filter sizes, e.g. \(s_1=3\) \(s_2 =2\), and \(c_{l-1}=N\), \(f_l=N\), we have the following complexities:

$$\begin{aligned} {{{\mathcal {O}}}}_1&= N^2 \;.\; s_1^2 \\ {{{\mathcal {O}}}}_2&= 2 \;.\;( N^2 \;.\; s_2^2) \end{aligned}$$

This replacement a \(s_1\times s_1\) layer with N input/output channels is replaced by two \(s_2\times s_2\) layers with N input/output channels. After the above replacement, the complexity involved in these layers is nearly unchanged, with the reduction fraction of \(2s_1^2 / s_2^2\approx 1\).

With the strategy above, we can “deepen” the network under the same complexity time budget. This allows us to obtain several architectures and pick the best accuracy. In concrete, from our experimental evaluation yielding the times mentioned in Sect. 4.5, we found the best accuracy using the configuration in our final network, which has 31 layers, 1729 convolutional size three filters, yielding 1,260,865 trainable parameters.


Scalability: Now, we delve into how the architecture scales with the size of input images. First, let us introduce some basic notation:

  • \({\textrm{conv2D}}_{{\textrm{F,st}}}\): 2D (contraction) convolution with F number of filters and with stride st

  • \({\textrm{conv2D}}^{\top }_{{\textrm{F,st}}}\): 2D (expansion) transpose convolution with F number of filters and with stride st

Our U-Net network includes the following main blocks:

  • First block: 2 x \({\textrm{conv2D}}_{32,1}\)

  • ith down block: 1 x \({\textrm{conv2D}}_{\textrm{F}_\textrm{i},2}\) + 1 x \({\textrm{conv2D}}_{\textrm{F}_\textrm{i},1}\)

  • ith up block: 2 x \({\textrm{conv2D}}_{\textrm{F}_\textrm{i},1}\) + 1 x \({\textrm{conv2D}}_{\textrm{F}_\textrm{i},2}^{\top }\)

  • Last block: 2 x \({\textrm{conv2D}}_{32,1}\) + 1 x \({\textrm{conv2D}}_{1,1}\)

We use five blocks for each down and up stage, thus having \(i\in \{1,\ldots ,5\}\). The number of filters for each block is \(F = \begin{bmatrix} F_0&\dots&F_5 \end{bmatrix} = \begin{bmatrix} 32&32&48&48&64&128 \end{bmatrix}\), where \(F_0\) accounts for the filter in the first and last blocks.

Now, it is straightforward to determine convolutions. Assuming an image size of \(W \times H\), we have:

  • First Block: \(2 \cdot F_0 \cdot W \cdot H\)

  • Down Block i: \(2 \cdot F_i \cdot W \cdot H \cdot 2^{-2i}\)

  • Up Block i: \(3 \cdot F_i \cdot W \cdot H \cdot 2^{-2i}\)

  • Last Block: \((2 \cdot F_0 + 1) \cdot W \cdot H\)

  • Total: \(\Bigl (1+4F_0 + 5 \sum _{i=1}^5{F_i \cdot 2^{-2i}}\Bigr ) W \cdot H = 189.625 W \cdot H\)

This means that, for a \(\varDelta\)-increment in the image resolution \((W+\varDelta )(H+\varDelta )\), we have a complexity of \({{{\mathcal {O}}}}(\varDelta ^2)\). So, roughly speaking, a twofold increase in image resolution would entail a fourfold increase in image processing time using the same architecture and memory footprint. Assuming that in the worst case, 80% of the CPU time is spent on running the Neural Network, the processing time per frame would be around 30 ms for an effective frame rate of 30 Hz, which is still reasonable.

5 Limitations

While our technique has proven to be very effective at restoring depth values from noisy RGB-D images, it can be improved in several ways. A notable limitation involves adequately addressing high-frequency temporal noise. While effective for general noise reduction, averaging pixel values across frames falls short in capturing and mitigating these rapid fluctuations. This suggests potential for future refinement. More sophisticated techniques should be capable of discerning and smoothing out high-frequency temporal noise without compromising the dynamic content of the scenes.

6 Conclusions and future work

We introduced SelfReDepth, a self-supervised approach for denoising and completing low-quality depth maps generated from consumer-grade sensors. Our technique advances self-supervised learning in-depth data denoising, offering a precise, data-driven architecture without reference data. This flexibility makes SelfReDepth easily adaptable across various environments and applications.

SelfReDepth’s architecture features two main elements: a denoising network and a target generation component. The denoising network is inspired by the original Noise2Noise [27] and MF2F [13] video denoisers and is responsible for learning how to denoise depth data without the need for reference data. Meanwhile, the target generation component fills in the gaps in target depth frames using color-guided FMM inpainting. The technique can denoise inaccurate depth values and paint out missing ones with this structure.

We also implemented and assessed SelfReDepth for both denoising efficacy and time performance. Results indicate real-time noise elimination and successful inpainting of depth gaps. Future work will focus on preserving image details compromised by denoising. Training with synthetic data might also improve depth inpainting performance and dampen oscillations.

In future work, we aim to explore controllable image denoising to generate clean sample frames with human perceptual priors and balance sharpness and smoothness. In most common filter-based denoising approaches, this can be straightforwardly achieved by regulating the filtering strength. However, for deep neural networks (DNN), regulating the final denoising strength requires performing network inference each time. This of course, hampers the real-time user interaction. Further work will address real-time controllable denoising, to be integrated into a video denoising pipeline that provides a fully controllable user interface to edit arbitrary denoising levels in real-time with only one-time DNN inference.

SelfReDepth represents a significant advancement in data denoising, tackling noise and depth hole challenges with notable efficiency. The outcomes of our research are encouraging, illustrating the algorithm’s capacity to mitigate these problems. However, the concomitant loss of certain image details in the process highlights areas for potential improvement. This observation underscores the need for additional investigation while pointing to clear pathways for refining future algorithm iterations. Such enhancements aim to improve the balance between our denoising algorithm’s robustness and critical image detail preservation, enhancing its already remarkable efficiency and making it more applicable to very complex scenarios.