Open AccessArticle

Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation

Ting-Hui Chiang

^*,

Meng-Hsiu Chiang

Ming-Han Tsai

and

Che-Cheng Chang

Department of Information Engineering and Computer Science, Feng Chia University, Taichung 407102, Taiwan

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(21), 11186; https://doi.org/10.3390/app122111186

Submission received: 14 September 2022 / Revised: 26 October 2022 / Accepted: 2 November 2022 / Published: 4 November 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Figure 1
In scenes with significant variations in depth, the multi-scale depth estimation technique SACRF [<a href="#B35-applsci-12-11186" class="html-bibr">35</a>] cannot adequately predict individual background/foreground depth maps, whereas our model could improve individual foreground and background depth map prediction using an attention mechanism and image segmentation. "> Figure 2
The architecture of our proposed model. "> Figure 3
The foreground depth prediction model architecture. "> Figure 4
Distribution mapA of the areas of image segments. "> Figure 5
A visualization of the results from our background depth prediction model using the different attention mechanisms: AM, SE, and CBAM. "> Figure 6
A visualization of the results from our foreground depth prediction model using the different attention mechanisms: AM, SE, and CBAM. "> Figure 7
A visualization of the results from the different stitching strategies: direct stitching and CNN stitching. "> Figure 8
A visualization of the comparison results. ">

Versions Notes

Abstract

While many monocular depth estimation methods have been proposed, determining depth variations in outdoor scenes remains challenging. Accordingly, this paper proposes an image segmentation-based monocular depth estimation model with attention mechanisms that can address outdoor scene variations. The segmentation model segments images into foreground and background regions and individually predicts depth maps. Moreover, attention mechanisms are also adopted to extract meaningful features from complex scenes to improve foreground and background depth map prediction via a multi-scale decoding scheme. From our experimental results, we observed that our proposed model outperformed previous methods by

27.5 %

on the KITTI dataset.

Keywords:

deep learning; depth information; image segmentation; Laplacian pyramid; monocular depth estimation

1. Introduction

Self-driving cars will become mainstream vehicles in the future. Self-driving cars need to use distance measurement techniques that can use depth information combined with object recognition techniques to maintain safe distances from surrounding cars. Current depth estimation techniques can be divided into active and passive sensing [1]. Active sensing uses signal reflection to measure distance, as with LiDAR and infrared sensors. LiDAR obtains depth information from targets using high-accuracy laser scanning, but is expensive and the equipment is large, which is unsuitable for commercial cars. Infrared sensors, which are cheaper, use infrared wavelengths to measure depth, but their effective sensing ranges are short. Meanwhile, passive sensing can estimate depth without emitting any signals; however, it can only use RGB information captured from cameras to estimate depth via computer vision techniques. Passive sensing is cheaper than active sensing. Passive sensing is commonly divided into binocular stereo vision and monocular depth estimation. Binocular stereo vision uses stereo matching to perform pixel correspondence and disparity calculations, which requires considerable computing costs and has poor prediction performance for low-texture scenes. With the development of artificial intelligence techniques, many studies have proposed neural network models that can predict depth from RGB images. These methods are based on monocular depth estimation and only use monocular cameras, which are low cost and suitable for commercial cars. These methods can be further divided into supervised and unsupervised learning methods.

Supervised monocular depth estimation methods can be regarded as regressive problems that are based on using the ground truths of depth maps to train models that map inputs to outputs, i.e., RGB images and depth images, respectively. Since convolutional neural networks (CNNs) are strong in terms of learned representations, previous studies [2,3,4,5,6,7,8,9,10,11,12,13,14] have adopted CNN architectures to estimate depth from RGB images. The authors of [2] proposed a two-component stack CNN to capture coarse-scale and fine-scale features to predict depth maps. Some studies [3,6,7,12,13,14] have proposed multi-scale methods to address the prediction difficulties caused by significant variations in foreground and background depth by fusing feature maps of different scales. The authors of [3] proposed a simple and fast multi-scale architecture with three prediction tasks. In addition, some researchers have adopted conditional random fields (CRFs) [4,8,9] to refine depth estimation techniques. The authors of [4] employed two CNNs to extract absolute and relative features that could reveal the absolute depth and relative distance between neighboring regions. Then, a CRF was applied to fine-tune the outputs of the two CNNs to estimate the depth of whole images.

Supervision-based methods require vast quantities of corresponding ground truth depth data for training. In contrast, unsupervised monocular depth estimation methods [15,16,17,18,19,20,21,22,23,24] can train models without any ground truth data. These methods [15,16,17,18,19,20,21,22,23,24] use the geometric constraints between frames as supervisory signals during the training phase. The authors of [16] used an image reconstruction loss between calibrated stereo pairs of color images to improve depth estimation accuracy. Some studies [20,22,23,24] have used ego-motion to predict depth; however, this may suffer from unidentified moving objects, resulting in poor prediction performance in certain areas. The authors of [15] proposed a geometry consistency loss for scale-consistent prediction that could avoid interference from moving objects and occlusion.

Some studies [25,26,27,28,29,30,31,32,33] have implicitly modeled the relationship between semantic segmentation and depth to improve performance. The authors of [27,28] proposed the joint training of a shared network with two different tasks: semantic segmentation and depth estimation. This was found to be helpful for both techniques. The authors of [29] proposed two types of propagation to learn transformations between semantic segmentation and depth feature spaces: cross-task and task-specific propagation.

As more attention mechanisms have been developed, some researchers have attempted to add attention mechanisms to monocular depth estimation methods to improve prediction accuracy. Based on multi-scale CRFs [34], the authors of [35] used a structured attention method to automatically extract multi-scale features to automatically regulate flows from different scales. In addition, the authors of [36] proposed a patch-wise attention method to enhance local area prediction accuracy. Patch-wise attention can extract features between neighboring pixels in the channel and spatial dimensions in local areas.

As shown in Figure 1, significant differences in field depth affect depth estimation such that models cannot adequately predict foreground and background depth maps at the same time, especially in outdoor scenes. Previous studies [3,6,7,12,13,14,35] have proposed multi-scale-based methods to solve this problem. However, these methods do not take advantage of image segmentation, which can partition images into several image segments, i.e., image regions or objects, that can provide hints to improve depth estimation. In this paper, we propose an image segmentation-based monocular depth estimation model that incorporates attention mechanisms. We adopted an image segmentation method (namely the cluster method) to segment foreground and background regions. Then, the depth estimation model could individually predict the depth maps of the foreground and background regions while mitigating any interference from significant differences in image depth. Moreover, we attempted to use attention mechanisms to extract meaningful features in the channel and spatial dimensions to improve depth estimation. We conducted several experiments on the KITTI dataset [37] to validate the performance of our proposed model. An ablation study showed the improvements from adding different attention mechanisms. Once we identified the most suitable attention mechanism, we used it to individually predict foreground and background depth maps and stitch together pairs of maps to create final depth maps. The results showed that our method could outperform previous methods by

27.5 %

on the KITTI dataset. The main contributions of our work are as follows:

The addition of attention mechanisms to extract meaningful features to improve depth estimation;
A segmentation technique that can segment foreground regions via a cluster method and use the segmented regions to improve foreground depth estimation;
A new architecture that can individually predict foreground and background maps while avoiding effects from significant differences in field depth, especially in outdoor scenes.

The remainder of this paper is organized as follows. Section 2 introduces some related works. Section 3 presents our proposed model. Our ablation study and comparison results are presented in Section 4. Section 5 concludes this paper.

2. Related Works

Early monocular depth estimation methods mainly relied on hand-crafted features. The authors of [38] predicted monocular depth using a linear model and depth cues, such as texture variations, gradients, and color. With the rise of deep learning, several deep learning-based monocular depth estimation methods have been proposed in recent years. Deep learning-based monocular depth estimation methods [39] can be divided into supervised learning and unsupervised learning methods. Some studies have implicitly modeled the relationship between semantic segmentation and depth to improve performance. In addition, some researchers have attempted to apply attention mechanisms to depth prediction neural networks. In the following sections, we introduce four categories of monocular depth prediction methods: supervised learning, unsupervised learning, joint learning, and depth prediction based on attention mechanisms.

2.1. Supervised Learning

The authors of [2] proposed the first CNN-based method that combined two components: a global coarse-scale network and a local fine-scale network. The former conducted global structure scene estimation based on entire images and the latter refined this estimation locally. The authors of [7] developed a deep ordinal regression network (DORN) that adopted dilated convolution to obtain high-resolution depth maps. In addition, ordinal regression was used to solve the slow convergence problem. Some researchers [12,13,14,40] have focused on using multi-scale CNNs to improve detail depth prediction. BTS [13] was developed as a novel DCNN that strengthened the effectiveness of guidance features and upsampled low-resolution depth maps from coarse to fine. The authors of [12] applied implicit constraints on a multi-scale decoder to guide the generation of adaptive depth estimation features, which could achieve better depth estimation results. LAPD [14] was constructed as a multi-scale autoencoder structure that incorporated a Laplacian pyramid into a decoder architecture that adopted edge and feature maps of different scales to enhance global structure and local detail depth estimation. VNL [40] was proposed as a decoder using adaptive merging blocks to fuse features of different scales and used virtual normal directions to build geometric constraints to improve depth prediction performance. The authors of [34,35] proposed multi-scale CRFs to enhance feature fusion and refine depth estimation. The authors of [34] used multi-scale depth maps generated by the inner layer of a CNN and fused them using a CRF framework.

2.2. Unsupervised Learning

The authors of [16] demonstrated an unsupervised deep neural network for single-image depth estimation. Instead of using the rare and resource-consuming method of aligned ground truths, the authors used the collected binocular stereo data and proposed a consistency loss that enhanced the consistency of the predicted depth maps across views. The authors of [17] proposed a monocular depth estimation method that did not require many labeled training samples, mainly using the autoencoder method. To train the model, they used sets of two images with known displacements as stereo pairs; however, due to the limitations of the selected sensor, not every pixel in the images had corresponding ground truth depth values. The authors of [41] proposed a method that could integrate supervised learning and unsupervised learning. Pixels with corresponding ground truth depth values in images were used for supervised learning, while those without corresponding ground truth depth values were used for unsupervised learning. This combination avoided the model predicting the locally optimal solution and improved learning speed. SfMLearner [22] was proposed as an end-to-end unsupervised learning method that could estimate image depth and camera pose from video sequences. It used depth and poses for wrapping and then used the target and wrapped images to calculate the loss.

2.3. Joint Learning

SOSD-Net [30] was developed as a joint monocular depth estimation and semantic segmentation neural network that could explore the geometric relationship between monocular semantic segmentation and depth estimation to embed semantic objectness to simultaneously learn geometric cues and scene parsing. The authors of [32] proposed a collaborative deconvolutional neural network that could perform depth estimation and semantic segmentation. It consisted of depth and semantic deconvolutional neural networks. A point-wise bilinear layer integrated feature maps from the two deconvolutional neural networks to fuse the semantic and depth information. Then, the integrated features were used for semantic segmentation and depth estimation by two sibling classification layers. The authors of [31] proposed a constrained monocular self-supervised depth estimation method in which depth prediction could be aligned to its segmentation counterpart. It used segmentation and depth estimation to estimate object boundaries. Then, it adopted the distance between the segmentation and depth edges to measure consistency for optimizing the depth estimation model. DSPNet [25] used a ResNet50 network to extract multi-level feature maps with multi-task learning to achieve high GFLOPs-to-accuracy ratios and simultaneously perform object detection, depth estimation, and image segmentation. The TRL framework [33] adopted the historical experiences of features from the previous time steps of two tasks to improve the estimation of the two tasks in the next time step. The authors designed a task-attentional model to propagate the features stream and correlate the two tasks, in which useful features were enhanced while task-irrelevant features were suppressed. The authors of [26] proposed a single real-time model to perform depth estimation and the semantic understanding of scenes simultaneously. They modified a real-time semantic segmentation network to reduce the number of floating point operations.

2.4. Depth Estimation Based on Attention Mechanisms

Humans can efficiently find regions of interest in complex scenes. Inspired by this behavior, attention mechanisms are used in CNNs to make models imitate the human visual system. Attention mechanisms can be viewed as dynamic weight adjustments to input image features. Attention mechanisms can be divided into three types [42]: spatial attention, channel attention, and mixed attention mechanisms. Spatial attention mechanisms [43] learn the relationships between pixels in images, generate plane attention maps, and generate corresponding weights for different pixels in order to select substantial spatial regions. Channel attention mechanisms [44] model the dependencies on the channels and allocate different weights to different channels. Mixed attention mechanisms [45] take advantage of channel and spatial attention mechanisms to generate channel and spatial attention masks and then use them to select valuable features.

Some researchers have proposed CNN attention mechanisms for DNN-based monocular depth methods [35,36,43,46,47]. PWA [36] was proposed as an approach that could focus on each local area and learn the relationships between neighboring pixels using corresponding spatial and channel attention networks to generate attention maps and detect essential features. CAD [46] was proposed as a channel-wise attention-based depth estimation network, in which the channel-wise-based structure perception module helped the model to enhance the perception of large-scale structures. The detail emphasis module used weight vectors to refine channel-wise feature maps to highlight local features and improve detail depth prediction. VISTA [47] was proposed as a variational structured attention network, which introduced a structured attention mechanism based on a multi-scale CNN that could model the correlation between a spatial-wise attention tensor and a channel-wise attention tensor to refine object depth estimation. SACRF [35] used an attention mechanism to adjust feature maps at different scales based on a multi-scale CRF model.

3. Method

As shown in Figure 2, our proposed method can be divided into five modules: attention-based feature extraction, foreground/background segmentation, foreground depth prediction, background depth prediction, and depth map stitching. Firstly, ResNet101 [48] is applied to extract feature maps from RGB images. Then, attention mechanisms are used to extract meaningful features

f_{a_{i}}

from the feature maps at different scales, where

i = 1

to 4. Secondly, the features are used to predict background depth maps via a multi-scale depth decoding scheme. Thirdly, to enhance the performance of foreground depth map prediction, RGB images of different scales are segmented into image regions and objects using mask region-based convolutional neural networks (Mask R-CNNs) [49] and foreground regions and objects are extracted, which are denoted as

f_{s_{i}}

. Fourthly,

f_{a_{i}}

and

f_{s_{i}}

are then used to predict foreground depth maps. Finally, the depth maps from the background and foreground depth prediction models are stitched together using a stitching scheme to fuse the two maps into final depth maps d.

3.1. Attention-Based Feature Extraction

The attention mechanisms that are widely used in computer vision are dynamic weight adjustment processes that can find salient regions in complex scenes [43,44,45,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66]. Thus, we adopted attention mechanisms that could extract meaningful features from salient regions to use as clues to predict the depth of complex scenes. Attention mechanisms can be divided into three categories: channel-wise, spatial-wise, and mixed attention mechanisms. Channel-wise attention methods adaptively recalibrate the weight of each channel to focus on meaningful features [44,50,51,52,53,54]. Spatial-wise attention methods focus on finding the relationships between pixels and training planar attention maps with corresponding weights for different regions within images [43,55,56,57,58,59,60]. Mixed attentions methods combine the advantages of the other two types of attention mechanism to adaptively fine-tune the weight of each channel and planar map [45,61,62,63,64,65,66]. Therefore, we took advantage of mixed attention mechanisms to extract features. Firstly, ResNet101 [48] is applied to extract multi-scale feature maps

f_{i}

(where

i = 1

to 4) from RGB images, as shown in Figure 3. Then,

f_{i}

is sent to the attention mechanism to extract features

f_{a_{i}}

(where

i = 1

to 4). The details of our attention mechanism feature extraction strategy are shown in the Experiments Section. Finally, the extracted features

f_{a_{i}}

are used to predict foreground and background depth maps.

3.2. Background Depth Prediction

After extracting

f_{a_{i}}

, a multi-scale depth decoding scheme is then applied to predict background depth maps [14]. The multi-scale depth decoding scheme has a multi-scale autoencoder structure that uses a multi-scale decoder to restore depth maps with multi-scale features

f_{a_{i}}

and edge maps

l_{i}

(where

i = 1

to 4), which are extracted by the attention mechanism and a Laplacian pyramid [14], respectively. The multi-scale decoder has five scales. Each decoder consists of a different number of convolutional layers (

3 * 3

) and the number of convolutional layers decreases from high to low scales. After

f_{a_{i}}

and

l_{i}

pass through the convolutional operations at each scale, the decoder of each scale outputs depth residuals of different scales. The depth residuals are resized using bilinear interpolation and element-wise summation is used to combine the depth residuals of each scale and obtain the background depth map

d_{b}

. Then,

d_{b}

is used to predict the final depth map in the depth map stitching module. To avoid severe errors in the overall depth maps due to scale problems, we adopted a depth loss function based on a scale-invariant error introduced in [2], as shown below:

L_{d} = \sqrt{\frac{1}{N} \sum_{i = 0}^{N} y_{i}^{2} - \frac{λ_{b}}{N^{2}} {(\sum_{i = 0}^{N} y_{i})}^{2}},

(1)

y_{i} = l o g (d_{i}) - l o g (d_{i}^{*}),

(2)

where N is the total number of valid pixels,

d_{i}

is the depth value of the ith pixel predicted by the background model,

d_{i}^{*}

is the ground truth depth value of the ith pixel, and

λ_{b}

is a coefficient. As with the settings in [14], we set

λ_{b}

as 0.85.

3.3. Foreground Segmentation

To determine foreground regions, a Mask R-CNN [49] is utilized to partition the images into image segments (i.e., image regions or objects), which can be used as clues to improve foreground depth prediction. Mask R-CNNs are an extension of Faster R-CNNs [67], which use region proposal networks (RPNs) to propose candidate object bounding boxes and then classify and resize the bounding boxes. To segment the shapes of objects, Mask R-CNNs add mask branches to reshape the bounding boxes. In our model, RGB images are segmented from multi-scale images to extract image segments of different scales. After partitioning the images into image segments, the areas of the segments are then used to classify the segments into the foreground and background categories using the K-means cluster method. Finally, the foreground segments

f_{s_{i}}

are used for foreground depth prediction (where

i = 1

to 4). The details of our foreground and background segment classification strategy are shown in Section 4.

3.4. Foreground Depth Prediction

Due to depth variations being nonlinear in outdoor scenes, foreground segments

f_{s_{i}}

are used as clues to improve foreground depth estimation in order to improve the depth prediction of object surfaces. The architecture of the foreground depth prediction model is shown in Figure 3. The

f_{a_{i}}

and

f_{s_{i}}

are concatenated together and a multi-scale depth decoding scheme is used to predict the foreground depth map

d_{f}

(where

f = 1

to 5). The foreground depth prediction loss function in [68] can enhance foreground depth prediction by penalizing depth errors in foreground object surfaces. The loss function is shown below:

L_{f g} = λ_{f} \times E_{f g} + (1 - λ_{f}) \times E_{b g},

(3)

where

E_{f g}

and

E_{b g}

are the foreground and background depth root mean square errors (RMSEs), respectively, and

λ_{f}

is a coefficient. As in [68], we set

λ_{f}

as 0.8.

3.5. Depth Map Stitching

To estimate the final depth maps, we use two strategies to fuse the foreground and background depth maps

d_{f}

and

d_{b}

, as follows: Strategy 1: intuitively extract the foreground depth from

d_{f}

and then overlay the foreground depth on the foreground position corresponding to

d_{b}

for the final depth map estimation; Strategy 2: apply a

3 * 3

convolution layer to fuse the background and foreground depth maps for the final depth map estimation. The CNN fusing loss function in [2] can avoid severe errors due to scale problems in the overall depth maps. The loss function is shown below:

L_{c f} = \sqrt{\frac{1}{N} \sum_{i = 0}^{N} y_{i}^{2} - \frac{λ}{N^{2}} {(\sum_{i = 0}^{N} y_{i})}^{2}},

(4)

y_{i} = l o g (d_{i}) - l o g (d_{i}^{*}),

(5)

where N is the total number of valid pixels,

d_{i}

is the depth value of the ith pixel fused by the CNN,

d_{i}^{*}

is the ground truth depth value of the ith pixel, and

λ

is a coefficient. As in [2], we set

λ

as 0.5. In the Experiments Section, we compare the performance of the above strategies.

4. Experiments

4.1. Dataset

We evaluated our model on the KITTI dataset. The KITTI dataset contains 61 outdoor scene images that were collected using cameras and laser scanners on moving vehicles. The outdoor scenes often have significant depth variations and contain nearby objects such as homes, pedestrians, cars, etc., as well as far away objects such as the sky. The resolution of the RGB images is 1242 × 375 pixels. Each RGB image has a corresponding depth map. We referred to [2] to split the dataset into training and testing sets. The training set had 32 scenes with 23,488 images and the testing set had 29 scenes with 697 images.

4.2. Implementation Details

4.2.1. Data Augmentation

Data augmentation is a common way to reduce overfitting in neural network models and improve their performance. We augmented the KITTI dataset via the following steps: we randomly cropped the images into

704 * 352

pixel segments from the original images; we then horizontally flipped the images and randomly rotated them in the range of

- 3

to 3 degrees; then, the RGB values were multiplied globally by a random value from

0.9

1.1

4.2.2. Training Settings

We used the PyTorch framework to build and train our proposed model on an Nvidia RTX3090 GPU with a 24 GB memory. For the foreground depth prediction model, we set the batch size and learning rate as 1 and

10^{- 4}

, respectively. For the background depth prediction model, we set the batch size and learning rate as 5 and

10^{- 4}

, respectively. The two models were trained for 50 epochs with the Adam optimizer. We chose a ResNet 101 network that was pre-trained on ImageNet [69] as the feature extractor. The weight of the ResNet 101 was fixed during training.

4.2.3. Foreground and Background Region Classification

To improve foreground depth prediction, we needed to use foreground segments as clues. We used a Mask R-CNN to segment RGB images from the KITTI dataset into image segments. We only considered physical objects, such as cars, pedestrians, and bicycles, and excluded the sky, roads, and mountains. Then, we calculated the areas of the image segments and used K-means clustering to classify the image segments into foreground and background regions. Figure 4 shows that an area of 20,000 was set as the threshold to classify the image segments from the KITTI dataset into foreground and background regions.

4.2.4. Configuration of Attention Mechanisms

To extract meaningful features, we adopted three kinds of attention mechanisms: channel-wise, spatial-wise, and mixed attention mechanisms. Firstly, for the channel-wise attention mechanism, we utilized a squeeze-and-excitation (SE) block [44], which performed global average pooling on the feature maps and used a fully connected layer to output the channel-wise attention maps. In this way, our model could learn global information on feature maps and extract the features of large-scale structures. Secondly, for the spatial-wise attention mechanism, we utilized an attention module (AM) [43], which performed a convolution operation on the feature maps and used the sigmoid activation function to obtain the planar attention maps. This attention mechanism allowed our model to establish the relationships between pixels in the images. Thirdly, for the mixed attention mechanism, we utilized a convolutional block attention module (CBAM) [45], which generated channel-wise attention maps and spatial-wise attention maps as feature maps. This method combined the characteristics of the other two attention mechanisms to adaptively fine-tune the weight of each channel and planar map. In our experiments, we attempted to select the suitable attention mechanisms for the foreground and background depth prediction models to extract meaningful features.

4.3. Evaluation Metrics

To compare the performance of our proposed model to those from other studies, we used the evaluation metrics proposed in [2], which have been widely applied to evaluate the performance of monocular depth prediction methods. The evaluation metrics are shown below:

Abs Rel = \frac{1}{|T|} \sum_{i \in T}^{} \frac{|d_{i} - d_{i}^{*}|}{d_{i}^{*}}

(6)

Sq Rel = \frac{1}{|T|} \sum_{i \in T}^{} \frac{{∥d_{i} - d_{i}^{*}∥}^{2}}{d_{i}^{*}}

(7)

RMSE = \sqrt{\frac{1}{|T|} \sum_{i \in T}^{} {∥d_{i} - d_{i}^{*}∥}^{2}}

(8)

RMSE \log = \sqrt{\frac{1}{|T|} \sum_{i \in T}^{} {∥l o g (d_{i}) - l o g (d_{i}^{*})∥}^{2}}

(9)

Accuracy = % o f d_{i} s . t . m a x (\frac{d_{i}}{d_{i}^{*}}, \frac{d_{i}^{*}}{d_{i}}) = δ < t h r e s h o l d

(10)

where

d_{i}

and

d_{i}^{*}

are the predicted depth value of pixel i and the ground truth depth value of pixel i, respectively, and T is the total number of valid ground truth pixels.

4.4. Ablation Experiment

In the ablation study, we tested the performance of each component in our proposed model, i.e., the background/foreground depth prediction models and the depth map stitching strategies, using different settings on the KITTI dataset.

4.4.1. Attention Mechanism Selection

Firstly, we compared the performance of our background depth prediction model with/without using the different attention mechanisms (AM, SE, and CBAM) for feature extraction. Table 1 shows that the background depth prediction model achieved the best performance when using the SE attention mechanism because the SE attention mechanism could extract global features that could enhance the background depth prediction of wide areas, especially in outdoor scenes. Figure 5 shows the depth prediction results from our background model using the different attention mechanisms.

We then evaluated our proposed foreground depth prediction model with/without using the different attention mechanisms for feature extraction. Table 2 shows that the foreground depth prediction model achieved the best performance when using the AM attention mechanism and indicates that the foreground depth prediction model needed some clues to enhance the foreground region depth prediction, especially using the spatial-wise attention mechanism to extract spatial features, which could reflect the 3D details of object surfaces. Figure 6 shows depth prediction results from our foreground model using the different attention mechanisms.

4.4.2. Depth Map Stitching

Finally, we combined the background and foreground depth maps into final depth maps. According to our experimental results, we chose to use the SE and AM attention mechanisms with our background and foreground depth prediction models to predict the background and foreground depth maps, respectively. Then, we compared the performance of two depth map stitching strategies: direct stitching and CNN stitching. The direct stitching strategy extracted the foreground depth regions from the foreground depth maps and overlaid them on the background depth maps. The CNN stitching strategy processed the directly stitched depth maps using convolutional layers with a filter size of

3 * 3

for stitching. Table 3 shows that the direct stitching strategy outperformed the CNN stitching strategy for stitching together the foreground and background depth maps. The CNN stitching strategy performed poorly as the convolutional layers blurred the depth maps. Figure 7 shows the depth prediction results from the different stitching strategies. Finally, we chose the direct stitching strategy to stitch together the foreground and background depth maps in the subsequent experiments.

4.5. Performance Comparison

We selected the model that achieved the best performance in the ablation study to compare to models from previous studies on the KITTI dataset. The model from [2,12,35] fuses multi-scale features that integrate low-resolution to high-resolution depth maps. LAPD [14] uses a decoder architecture that adopts feature maps of different scales to enhance global structure and local detail depth estimation and combines edge maps with feature maps to improve depth prediction. DORN [7] uses a deep ordinal regression that adopts dilated convolution to obtain high-resolution depth maps. Furthermore, ordinal regression can also solve the problem of slow convergence. As not every image in the KITTI dataset contains foreground regions, we applied two dataset settings to evaluate the depth map prediction performance of our model and the selected previous methods.

Firstly, we evaluated the depth prediction of images from the KITTI dataset that contain foreground regions and excluded images without foreground regions. Table 4 shows that our model outperformed the previous methods. Our proposed model outperformed the state-of-the-art LAPD model by

0.3 %

δ < 1.25

and

0.09

in Sq Rel. Since our proposed model was beneficial for foreground depth prediction, the foreground and background models were separated to avoid interference between the foreground and background depth prediction. Moreover, we used an attention mechanism to extract meaningful features that could be used as clues to enhance foreground depth prediction. Figure 8 shows that our model could contour the details of foreground objects in the depth maps. For example, as shown in the second column, only our method could contour the windows of the SUV in the depth image. Although our proposed model did not offer significant improvements compared to the previous methods, it could still be seen that only our proposed method could depict the details of foreground objects while mitigating interference from significant differences in image depth.

Secondly, we evaluated our model and the selected previous methods on the entire KITTI dataset, including images that contain foreground regions and those that do not. Table 5 shows the results from the depth map prediction of the images. Our method achieved the best performance out of all tested models. It surpassed LAPD, which is also a method that considers multi-scale features maps, by

0.1 %

δ < 1.25

and

0.051

in RMSE. Even though this was only a minor improvement, we still proved that individually predicting foreground and background depth maps is beneficial to depth prediction. Meanwhile, it is also worth mentioning that our model adopted channel-wise and spatial-wise attention modules to extract features that could highlight global and local information and strengthen the background and foreground prediction models individually.

In addition, we also evaluated the execution times of our model and the previous methods. We randomly selected 30 images from the KITTI dataset for our evaluation of the execution time of depth prediction. DORN [7] and SACRF [35] only uploaded the depth prediction results for each image from the KITTI dataset on GitHub without code. Therefore, we could not reproduce these two methods to evaluate the execution times of their depth prediction. Table 6 shows the execution time results. Eigen [2] outperformed the other methods. Our method had three main components: a Mask R-CNN and the foreground and background depth prediction modules. This resulted in time-consuming but higher accuracy depth prediction. We could further investigate model reduction strategies and optimize computation costs in the future.

5. Conclusions

This paper proposed an architecture that exploits foreground and background region information to individually predict foreground and background depth maps using attention mechanisms. As depth variations can be wide in outdoor scenes, monocular depth estimation is easily disturbed. Therefore, an image segmentation-based monocular depth estimation model with attention mechanisms was developed to segment images into foreground and background regions and use foreground and background depth prediction models to predict individual depth maps. To enhance depth prediction, we also adopted attention mechanisms to extract meaningful features from salient regions and use them as clues to predict depth in complex scenes. Subsequently, the foreground and background depth maps were stitched together to create the final prediction results.

Several experiments were conducted on the KITTI dataset [37] to demonstrate the improved performance of the proposed method over other methods for depth map estimation. The experimental results showed that our model could more clearly contour the shapes of foreground objects. This showed that individually predicting foreground and background depth maps using different attention modules is beneficial for depth map prediction. The critical methods utilized by our approach included a multi-scale depth decoding scheme and the utilization of different attention mechanisms. Although our approach performed well on the KITTI dataset, there is room for improvement in indoor scenes. Future work will investigate model reduction strategies and improvements in the depth map prediction of indoor scenes.

Author Contributions

Methodology, T.-H.C.; Software, T.-H.C. and M.-H.C.; Validation, T.-H.C. and M.-H.C.; Writing—original draft, T.-H.C.; Writing—review & editing, T.-H.C., M.-H.T. and C.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is co-sponsored by 110-2634-F-A49-004- National Science and Technology Council (NSTC) and 110-2221-E-035-032-MY3 NSTC.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, Y.; Jiang, J.; Sun, J.; Bai, L.; Wang, Q. A Survey of Depth Estimation Based on Computer Vision. In Proceedings of the IEEE 5th International Conference on Data Science Cyberspace, Hong Kong, China, 27–29 July 2020; pp. 135–141. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Proceedings of the Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels With a Common Multi-Scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
Hua, Y.; Tian, H. Depth estimation with convolutional conditional random field network. Neurocomputing 2016, 214, 546–554. [Google Scholar] [CrossRef]
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the 4th International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
Hu, J.; Ozay, M.; Zhang, Y.; Okatani, T. Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1043–1051. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
Li, B.; Shen, C.; Dai, Y.; van den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1119–1127. [Google Scholar]
Mousavian, A.; Pirsiavash, H.; Košecká, J. Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks. In Proceedings of the 4th International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 611–619. [Google Scholar]
Lee, J.H.; Heo, M.; Kim, K.R.; Kim, C.S. Single-Image Depth Estimation Based on Fourier Domain Analysis. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 330–339. [Google Scholar]
Cheng, X.; Wang, P.; Yang, R. Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 108–125. [Google Scholar]
Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
Zuo, Y.; Fang, Y.; Yang, Y.; Shang, X.; Wu, Q. Depth Map Enhancement by Revisiting Multi-Scale Intensity Guidance Within Coarse-to-Fine Stages. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4676–4687. [Google Scholar] [CrossRef]
Song, M.; Lim, S.; Kim, W. Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4381–4393. [Google Scholar] [CrossRef]
Bian, J.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 35–45. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation With Left-Right Consistency. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6602–6611. [Google Scholar]
Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
Luo, Y.; Ren, J.; Lin, M.; Pang, J.; Sun, W.; Li, H.; Lin, L. Single View Stereo Matching. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 155–163. [Google Scholar]
Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-Supervised Monocular Depth Hints. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 2162–2171. [Google Scholar]
Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth Prediction without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. AAAI Conf. Artificial Intell. 2019, 33, 8001–8008. [Google Scholar] [CrossRef] [Green Version]
Poggi, M.; Tosi, F.; Mattoccia, S. Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions. In Proceedings of the 6th International Conference on 3D Vision, Verona, Italy, 5–8 September 2018; pp. 324–333. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging Into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 3827–3837. [Google Scholar]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D Packing for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2482–2491. [Google Scholar]
Chen, L.; Yang, Z.; Ma, J.; Luo, Z. Driving Scene Perception Network: Real-Time Joint Detection, Depth Estimation and Semantic Segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1283–1291. [Google Scholar]
Nekrasov, V.; Dharmasiri, T.; Spek, A.; Drummond, T.; Shen, C.; Reid, I. Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 7101–7107. [Google Scholar]
Chen, P.Y.; Liu, A.H.; Liu, Y.C.; Wang, Y.C.F. Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2800–2809. [Google Scholar]
Zhang, Z.; Cui, Z.; Xu, C.; Yan, Y.; Sebe, N.; Yang, J. Pattern-Affinitive Propagation Across Depth, Surface Normal and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4101–4110. [Google Scholar]
He, L.; Lu, J.; Wang, G.; Song, S.; Zhou, J. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 2021, 440, 251–263. [Google Scholar] [CrossRef]
Zhu, S.; Brazil, G.; Liu, X. The Edge of Depth: Explicit Constraints Between Segmentation and Depth. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, J.; Wang, Y.; Li, Y.; Fu, J.; Li, J.; Lu, H. Collaborative Deconvolutional Neural Networks for Joint Depth Estimation and Semantic Segmentation. IEEE Trans. Neural Netw. Learning Syst. 2018, 29, 5655–5666. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Cui, Z.; Xu, C.; Jie, Z.; Li, X.; Yang, J. Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Xu, D.; Ricci, E.; Ouyang, W.; Wang, X.; Sebe, N. Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 161–169. [Google Scholar]
Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3917–3925. [Google Scholar]
Lee, S.; Lee, J.; Kim, B.; Yi, E.; Kim, J. Patch-wise attention network for monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1873–1881. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Saxena, A.; Sun, M.; Ng, A.Y. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 824–840. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
Yin, W.; Liu, Y.; Shen, C.; Yan, Y. Enforcing Geometric Constraints of Virtual Normal for Depth Prediction. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5683–5692. [Google Scholar]
Kuznietsov, Y.; Stückler, J.; Leibe, B. Semi-Supervised Deep Learning for Monocular Depth Map Prediction. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2215–2223. [Google Scholar]
Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. arXiv 2021, arXiv:2111.07624. [Google Scholar] [CrossRef]
Liu, S.; Johns, E.; Davison, A.J. End-To-End Multi-Task Learning With Attention. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the International Conference on 3D Vision, Prague, Czech Republic, 12–15 September 2021; pp. 464–473. [Google Scholar]
Yang, G.; Rota, P.; Alameda-Pineda, X.; Xu, D.; Ding, M.; Ricci, E. Variational Structured Attention Networks for Deep Visual Representation Learning. IEEE Trans. Image Process. 2022. Early Access. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global Second-Order Pooling Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3019–3028. [Google Scholar]
Lee, H.; Kim, H.E.; Nam, H. SRM: A Style-Based Recalibration Module for Convolutional Neural Networks. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Ocnet: Object context network for scene parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Xie, S.; Liu, S.; Chen, Z.; Tu, Z. Attentional ShapeContextNet for Point Cloud Recognition. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4606–4615. [Google Scholar]
Yan, X.; Zheng, C.; Li, Z.; Wang, S.; Cui, S. PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5588–5597. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6450–6458. [Google Scholar]
Liu, J.J.; Hou, Q.; Cheng, M.M.; Wang, C.; Feng, J. Improving Convolutional Networks With Self-Calibrated Convolutions. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10093–10102. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Wang, X.; Yin, W.; Kong, T.; Jiang, Y.; Li, L.; Shen, C. Task-Aware Monocular Depth Estimation for 3D Object Detection. AAAI Conf. Artificial Intell. 2020, 34, 12257–12264. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]

Figure 1. In scenes with significant variations in depth, the multi-scale depth estimation technique SACRF [35] cannot adequately predict individual background/foreground depth maps, whereas our model could improve individual foreground and background depth map prediction using an attention mechanism and image segmentation.

Figure 2. The architecture of our proposed model.

Figure 3. The foreground depth prediction model architecture.

Figure 4. Distribution mapA of the areas of image segments.

Figure 5. A visualization of the results from our background depth prediction model using the different attention mechanisms: AM, SE, and CBAM.

Figure 6. A visualization of the results from our foreground depth prediction model using the different attention mechanisms: AM, SE, and CBAM.

Figure 7. A visualization of the results from the different stitching strategies: direct stitching and CNN stitching.

Figure 8. A visualization of the comparison results.

Table 1. The evaluation of our background depth prediction model using the different attention mechanisms.

Attention Mechanism	$δ$ < 1.25	$δ$ < 1.25 $^{2}$	$δ$ < 1.25 $^{3}$	Abs Rel	Sq Rel	RMSE	RMSE log
No Attention Mechanism	0.949	0.991	0.998	0.063	0.289	2.986	0.102
AM	0.948	0.991	0.998	0.062	0.291	3.102	0.102
SE	0.951	0.992	0.998	0.062	0.278	2.977	0.101
CBAM	0.922	0.984	0.996	0.073	0.368	3.434	0.122

Table 2. The evaluation of our foreground depth prediction model using the different attention mechanisms.

Attention Mechanisms	$δ$ < 1.25	$δ$ < 1.25 $^{2}$	$δ$ < 1.25 $^{3}$	Abs Rel	Sq Rel	RMSE	RMSE log
No Attention Mechanism	0.947	0.991	0.996	0.069	0.105	0.860	0.097
AM	0.969	0.993	0.996	0.057	0.080	0.797	0.086
SE	0.967	0.993	0.996	0.059	0.088	0.810	0.086
CBAM	0.965	0.993	0.996	0.058	0.087	0.808	0.086

Table 3. The evaluation of our depth map stitching strategies.

Stitching Strategy	$δ$ < 1.25	$δ$ < 1.25 $^{2}$	$δ$ < 1.25 $^{3}$	Abs Rel	Sq Rel	RMSE	RMSE log
Direct Stitching	0.963	0.995	0.999	0.058	0.198	2.395	0.089
CNN Stitching	0.963	0.995	0.999	0.058	0.198	2.400	0.090

Table 4. A comparison between the selected methods on images that contained foreground regions.

Method	$δ$ < 1.25	$δ$ < 1.25 $^{2}$	$δ$ < 1.25 $^{3}$	Abs Rel	Sq Rel	RMSE	RMSE log
LAPD [14]	0.968	0.995	0.999	0.056	0.182	2.281	0.087
Eigen [2]	0.652	0.880	0.952	0.221	1.624	6.248	0.293
DORN [7]	0.935	0.983	0.993	0.085	0.354	2.852	0.129
VNL [40]	0.938	0.989	0.997	0.068	0.301	3.296	0.116
SACRF [35]	0.812	0.940	0.979	0.146	1.013	4.774	0.213
BTS [12]	0.961	0.994	0.999	0.055	0.218	2.623	0.092
VISTA [47]	0.964	0.995	0.999	0.055	0.190	2.357	0.089
Ours	0.969	0.996	0.999	0.054	0.173	2.256	0.085

Table 5. The overall comparison between the selected methods on the entire KITTI Dataset.

Method	$δ$ < 1.25	$δ$ < 1.25 $^{2}$	$δ$ < 1.25 $^{3}$	Abs Rel	Sq Rel	RMSE	RMSE log
LAPD [14]	0.962	0.994	0.999	0.059	0.212	2.446	0.091
Eigen [2]	0.688	0.920	0.964	0.193	1.371	5.976	0.265
DORN [7]	0.938	0.986	0.995	0.080	0.334	2.920	0.120
VNL [40]	0.937	0.989	0.997	0.072	0.319	3.384	0.118
SACRF [35]	0.827	0.951	0.984	0.132	0.896	4.717	0.195
BTS [12]	0.956	0.993	0.998	0.059	0.241	2.756	0.096
VISTA [47]	0.959	0.993	0.999	0.059	0.212	2.462	0.092
Ours	0.963	0.995	0.999	0.058	0.198	2.395	0.089

Table 6. A comparison of the execution times of the selected methods.

Method	Average Execution Time per Image (s)
LAPD [14]	0.41
Eigen [2]	0.14
DORN [7]	-
VNL [40]	0.5
SACRF [35]	-
BTS [12]	0.27
VISTA [47]	0.21
Ours	2.58

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chiang, T.-H.; Chiang, M.-H.; Tsai, M.-H.; Chang, C.-C. Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation. Appl. Sci. 2022, 12, 11186. https://doi.org/10.3390/app122111186

AMA Style

Chiang T-H, Chiang M-H, Tsai M-H, Chang C-C. Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation. Applied Sciences. 2022; 12(21):11186. https://doi.org/10.3390/app122111186

Chicago/Turabian Style

Chiang, Ting-Hui, Meng-Hsiu Chiang, Ming-Han Tsai, and Che-Cheng Chang. 2022. "Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation" Applied Sciences 12, no. 21: 11186. https://doi.org/10.3390/app122111186

APA Style

Chiang, T. -H., Chiang, M. -H., Tsai, M. -H., & Chang, C. -C. (2022). Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation. Applied Sciences, 12(21), 11186. https://doi.org/10.3390/app122111186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Based Background/Foreground Monocular Depth Prediction Model Using Image Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Supervised Learning

2.2. Unsupervised Learning

2.3. Joint Learning

2.4. Depth Estimation Based on Attention Mechanisms

3. Method

3.1. Attention-Based Feature Extraction

3.2. Background Depth Prediction

3.3. Foreground Segmentation

3.4. Foreground Depth Prediction

3.5. Depth Map Stitching

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.2.1. Data Augmentation

4.2.2. Training Settings

4.2.3. Foreground and Background Region Classification

4.2.4. Configuration of Attention Mechanisms

4.3. Evaluation Metrics

4.4. Ablation Experiment

4.4.1. Attention Mechanism Selection

4.4.2. Depth Map Stitching

4.5. Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI