Supervised Object-Specific Distance Estimation from Monocular Images for Autonomous Driving
<p>(<b>a</b>) A sample image from the dataset with the bounding boxes obtained with YOLOv5; (<b>b</b>) its corresponding LIDAR point cloud; (<b>c</b>) the depth map generated by Monodepth2. Ground-truth distances and Monodepth2 object distances are shown in blue boxes.</p> "> Figure 2
<p>The CDR model topology. The model was trained with a batch size of 512, learning rate of 0.0005 and the Adam optimizer. The version of the network presented in the picture corresponds to the “base” decoder function, using the top output fully-connected layer with four neurons. The model was implemented in PyTorch.</p> "> Figure 3
<p>Experimental results for different CDR model settings. The left-hand plots show the MAE values between the model’s predictions and the ground-truth distance values, while the right–hand plots display the MAE metric values distance–wise. Each plot features the baseline (Monodepth2 model) MAE values. Weighted and unweighted average MAE values are shown in the top left corners of the distance plots. Here (<b>a</b>) displays the impact of different downsampling strategies on the model’s performance, (<b>b</b>)—the impact of the loss function choice, (<b>c</b>)—the impact of the <math display="inline"><semantics> <msub> <mi>λ</mi> <mi>c</mi> </msub> </semantics></math> value.</p> "> Figure 4
<p>Experimental results for different CDR model settings (continuation). Here (<b>a</b>) shows the impact of the decoder function choice on the model’s performance, (<b>b</b>)—the impact of the <math display="inline"><semantics> <mi>β</mi> </semantics></math> value.</p> "> Figure 5
<p>Examples demonstrating the proposed model’s performance on several samples from the validation and testing sets: (<b>a</b>) ground-truth LIDAR point clouds; (<b>b</b>) corresponding Monodepth2-generated depth maps; (<b>c</b>) distances predicted by the CDR model.</p> ">
Abstract
:1. Introduction
- −
- A convolutional neural network model capable of extracting distance information from the dimensions (represented by bounding boxes) and the appearance of road agents while retaining a relatively low number of parameters and a high inference speed;
- −
- A data-preprocessing technique that forces the network to learn object-specific features and that further increases its response rate;
- −
- An optics-based method of converting the proposed network’s outputs to metric distance values that takes into account bounding box uncertainties;
- −
- A comprehensive study of the data augmentation and sampling techniques as well as structural settings that can improve the model’s performance on a heavily imbalanced dataset.
2. Materials and Methods
2.1. Preprocessing and Analysis of the KITTI Dataset
- (1)
- Bounding box generation. This step is performed using the YOLOv5 object detection model. As the model proposed here is intended to be used for autonomous vehicles, only the road agent object classes are considered, namely, “person”, “bicycle”, “car”, “motorcycle”, “bus”, “train”, and “truck” (For simplicity, these classes are referred in the experimental section as C0, C1,…, C6). An image sample with the corresponding bounding boxes can be seen in Figure 1a.
- (2)
- Ground-truth preprocessing. Sparse LIDAR point clouds provided as raw ground truth in the KITTI dataset are converted into distance values via averaging the distance values within object bounding boxes that are applied to the corresponding point clouds. It is important to note that point cloud sparsity affects the precision of the averaging procedure. To address this, we consider only the distance values less then or equal to half the mean distance value during ground truth distance generation. An example of a point cloud with overlapped bounding boxes is shown in Figure 1b.
- (3)
- The same averaging process is applied to the depth maps generated with Monodepth2. We use this data further for comparison during the proposed model’s performance evaluation. Monodepth2 was trained in mono+stereo mode, which enables conversion from disparity to depth maps according to Equation (1).
- (1)
- Random oversampling of the minority classes with subsequent augmentation. This procedure was performed in all experiments and consists of randomly copying the members of the minority classes (until the volume of the respective class is enlarged by a factor of 2) and transforming them by randomly applying color jittering, Gaussian blur and adjustment of sharpness (All the transformations were imported from the torchvision module of the PyTorch ecosystem [16]. Augmentation parameters: ColorJitter—brightness = 0.4, contrast = 0.4, saturation = 0.4, hue = 0.2; GaussianBlur—kernel_size = 3, sigma = (0.1, 2); RandomAdjustSharpness—sharpness_factor = 0.7, p = 0.5. All other parameters are set to default).
- (2)
- Random downsampling of the majority class. In the experiments that use this technique, the members of the majority class are randomly excluded until the volume of the class is reduced to a fixed number of samples (in all experiments, this number was empirically determined to be 7000).
- (3)
- Near-miss downsampling [17] of the majority class. Unlike random downsampling, this method is deterministic and relies on utilizing a certain distance measure to retain specific members of the majority class. Instead of the widely-used Euclidean distance, average structural similarity (SSIM) [18] is used as the distance measure between the members of the majority and minority classes. This choice is motivated by the better performance of this metric in terms of capturing similarities between image samples. Two versions of the algorithm are considered, where 7000 car samples with the highest average SSIM towards their ten closest (first version, referred to as NM1) or farthest (NM2) neighbors in the minority classes are retained.
2.2. Convolutional Object-Specific Depth Regression Model Topology
- (1)
- Preprocessing block: This part of the model performs the image downsizing (in all experiments, the images are downscaled to resolution, allowing the use of a smaller feature extractor and consequently increasing computational speed) and object-centered cropping, i.e., extracting a square fraction of the input image with the given side length centered at the midpoint of the bounding box. The side length of the square window is defined as a fraction of the smallest dimension of the input image, rounded up (See Equation (2)):
- (2)
- Feature extraction block: To extract the features from the cropped input images, a pretrained convolutional neural network is used, namely, the ConvNeXt architecture [19]. This model has been shown to achieve state-of-the-art results on the ImageNet benchmark and is available in multiple configurations suitable for the inputs of different sizes. We used the small version as a backbone for the CDR model due to the small input dimensions ensured by the cropping performed within the preprocessing block.
- (3)
- Regression block: This part of the model consists of seven fully-connected layers with ReLU activation functions. Five of these layers are positioned sequentially and are intended to gradually reduce the feature vector’s dimensionality. The last two layers operate in parallel; one of them yields one or four values (depending on the decoder block type), while the other is used to predict the output distance standard deviation. It is initialized with he Gaussian-distributed random weights with the parameters (where stands for the mean weight value, for the standard deviation) and has no bias term. This layer is used when the network is trained with the Kullback–Leibler Divergence Loss Function; the details are discussed below.
- (4)
- Decoding block: This block converts the output activations of the neural network into metric distance values. We considered two versions of the decoder: a “dummy” decoder, where a single output neuron of the last fully-connected layer yields the predicted distance value directly, and the optics-based decoder function (“base” decoder) defined in Equation (3), where and are the camera’s focal parameters, w and h stand for the bounding box dimensions, and are the output activations of the neural network. It is evident from the defined distance function that and serve as predicted object dimensions in the world reference frame, while and serve as correction terms to the bounding box dimensions. This decoding scheme provides the network with explicit information about the nature of the task being solved, and therefore increases the training efficiency.
2.3. Loss Functions
- –
- Classwise imbalance of the dataset. Despite the balancing procedures performed during the data preprocessing stage, further balancing may be needed during the model’s training.
- –
- Randomness introduced by oversampling, downsampling, and imperfect ground-truth data. The errors accumulated due to these factors may strongly affect the standard deviation of the model’s predictions, reducing its overall robustness.
3. Results
3.1. Hyperparameter Optimization
- −
- The number of fully connected layers in the regression block: The dimensionality of the feature vector yielded by the ConvNeXt model (784) has to be decreased to 4 in the output layer. It was observed during hyperparameter optimization that a sequence of five fully connected layers with gradually decreasing output dimensions (from 512 to 32) resulted in lower MAE values compared to other configurations included in the search space;
- −
- Learning rate: It was observed that a learning rate lower than 0.0005 resulted in ineffective training, while higher learning rate values increased the output error;
- −
- Batch size: A batch size of 512 is a compromise between accuracy and computational complexity. Increasing it further did not show significant accuracy improvements while requiring much more GPU memory;
- −
- Majority class size after downsampling: A sample count less than 7000 resulted in underfitting, while a larger sample count led to decreased performance on the minority classes.
3.2. Experiments
- (1)
- Downsampling strategy. Random downsampling and Near-Miss downsampling are considered. All other model parameters are fixed as follows: , , MSE Loss, “base” decoder. The results are presented in Figure 3a. As can be seen from the class-wise plot, the random downsampling strategy, while performing significantly worse for class C5 (“train”), performs similarly or better for other classes. The distance-wise plot shows equally similar performance. Considering the metrics, the weighted MAE metric is the lowest for the random downsampling method, albeit exhibiting the highest standard deviation. Considering those facts, in further experiments the random downsampling strategy is used.
- (2)
- Loss function. Weighted MSE loss and Weighted Kullback–Leibler Divergence loss (WKLD) are considered. All other model’s parameters are fixed as follows: , , random downsampling, “base” decoder. While both loss functions perform similarly (see Figure 3b), WKLD loss yields slightly lower weighted MAE value, with significantly lower standard deviation, most likely due to it being designed to minimize deviations in the model’s predictions.
- (3)
- Crop factor (cfactor). and are considered. All other model’s parameters are fixed as follows: , WKLD loss, random downsampling, “base” decoder. From the plots in Figure 3c, it is evident that a lower crop factor value of 0.3 is optimal for the CDR model.
- (4)
- Decoder function. The “Base” decoder (optics-based) and “Dummy” decoder (single output node) are considered. All other model parameters are fixed as follows: , WKLD loss, random downsampling, “base” decoder, . Thee plots in Figure 4a clearly indicate that the optics-based decoder performs significantly better, validating the hypothesis of simplifying the training process for the model.
- (5)
- value. are considered. All other model parameters are fixed as follows: , WKLD loss, random downsampling, “base” decoder. The MAE values in Figure 4b suggest that the optimal values are or , with performing slightly better in terms of standard deviation.
3.3. Response Speed Evaluation
4. Discussion
- –
- Trains are significantly less common in the road agent setting and can be arguably omitted as a relevant object class. The lack of representation of this class makes errors by an object-specific model higher than for the Monodepth2 model, which is trained in a self-supervised manner based on pose estimations, making it independent of the sufficient presence of a specific object type in the dataset. This is a fundamental limitation of any object-based approach, which can be mitigated by further augmenting the data with artificially generated samples.
- –
- The same argument is valid for objects at higher distances. However, the lack of representation of such objects in the dataset is most likely due to the limitations of the object detection model; therefore, this problem can be addressed by adjusting the confidence threshold for it or by switching to a better-performing alternative.
5. Conclusions
- –
- In Section 3 it was shown that:
- (1)
- The best-performing downsampling strategy is random downsampling;
- (2)
- The weighted Kullback–Leibler loss function (WKLD) results in lower MAE values and significantly reduces the error standard deviation;
- (3)
- β parameter values for an effective number of sample weightings lie within the 0.9–0.99 range, with β = 0.99 performing slightly better;
- (4)
- The optics-based (“base”) decoder function definitively outperforms the “dummy” decoder with a single output node;
- (5)
- The CDR model in its best-performing configuration yields 15 ± 1% lower average-weighted MAE values than the Monodepth2 model;
- (6)
- According to the performed inference speed test, the CDR model runs approximately 10 times faster than Monodepth2.
- –
- Among the model’s limitations, the following are apparent from the experiments:
- (1)
- Despite the balancing procedures, the CDR model performs worse than the self-supervised model on the least represented object classes (i.e., “train”). This is evident from the unweighted MAE metric, which is 6% higher on average for the proposed model. This limitation can potentially be mitigated by using synthetic data;
- (2)
- The error values increase for the CDR model as the distance grows. This is partly caused by the limitations of the object detector. Therefore, improving this part of the pipeline is likely to improve the results;
- (3)
- Globally, the error values of the model do not go substantially lower than the 2 (m) mark. This may be a major drawback for the scenarios where an autonomous vehicle is operating in dense traffic.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
MDPI | Multidisciplinary Digital Publishing Institute |
DOAJ | Directory of open access journals |
MAE | Mean Absolute Error |
LPG | Local Planar Guidance |
DFE | Dense Feature Extractor |
CFE | Contextual Feature Extractor |
SSIM | Structural Similarity Index Measure |
NM1 | Near-Miss Algorithm (version 1) |
NM2 | Near-Miss Algorithm (version 2) |
RND | Random (downsampling) |
MSE Loss | Mean Squared Error Loss |
KL-Loss | Kullback–Leibler Loss |
WKLD Loss | Weighted Kullback–Leibler Divergence Loss |
cfactor | Crop factor |
References
- Godard, C.; Mac Aodha, O.; Brostow, G. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2485–2494. [Google Scholar]
- Zhao, S.; Fu, H.; Gong, M.; Tao, D. Geometry-aware symmetric domain adaptation for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9788–9798. [Google Scholar]
- Khan, F.; Salahuddin, S.; Javidnia, H. Deep learning-based monocular depth estimation methods—A state-of-the-art review. Sensors 2020, 20, 2272. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Han, M.; Ko, D.; Suh, I. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
- Chen, Y.; Zhao, H.; Hu, Z.; Peng, J. Attention-based context aggregation network for monocular depth estimation. Int. J. Mach. Learn. Cybern. 2021, 12, 1583–1596. [Google Scholar] [CrossRef]
- Kumar, V.; Hiremath, S.; Bach, M.; Milz, S.; Witt, C.; Pinard, C.; Yogamani, S.; Mäder, P. Fisheyedistancenet: Self-supervised scale-aware distance estimation using monocular fisheye camera for autonomous driving. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
- Brent, G.; Corso, J. Depth from camera motion and object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Vajgl, M.; Hurtik, P.; Nejezchleba, T. Dist-YOLO: Fast Object Detection with Distance Estimation. Appl. Sci. 2022, 12, 1354. [Google Scholar] [CrossRef]
- Zhu., J.; Fang, Y. Learning object-specific distance from a monocular image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 3839–3848. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Mani, I.; Zhang, I. kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of the Workshop on Learning from Imbalanced Datasets, ICML, Washington, DC, USA, 21 August 2003. [Google Scholar]
- Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LO, USA, 21–24 June 2022. [Google Scholar]
- He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Cui, Y.; Jia, M.; Lin, T.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- De Sousa Ribeiro, F.; Calivá, F.; Swainson, M.; Gudmundsson, K.; Leontidis, G.; Kollias, S. Deep bayesian self-training. Neural Comput. Appl. 2020, 32, 600–612. [Google Scholar] [CrossRef] [PubMed]
Set | C0 | C1 | C2 | C3 | C4 | C5 | C6 |
---|---|---|---|---|---|---|---|
train | 3.15 | 2.77 | 4.47 | 2.09 | 2.37 | 1.58 | 3.03 |
valid | 1.62 | 1.46 | 3.07 | 0.48 | 0.78 | 0.30 | 1.67 |
test | 2.06 | 1.54 | 3.15 | 0.70 | 0.90 | 1.18 | 1.86 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Davydov, Y.; Chen, W.-H.; Lin, Y.-C. Supervised Object-Specific Distance Estimation from Monocular Images for Autonomous Driving. Sensors 2022, 22, 8846. https://doi.org/10.3390/s22228846
Davydov Y, Chen W-H, Lin Y-C. Supervised Object-Specific Distance Estimation from Monocular Images for Autonomous Driving. Sensors. 2022; 22(22):8846. https://doi.org/10.3390/s22228846
Chicago/Turabian StyleDavydov, Yury, Wen-Hui Chen, and Yu-Chen Lin. 2022. "Supervised Object-Specific Distance Estimation from Monocular Images for Autonomous Driving" Sensors 22, no. 22: 8846. https://doi.org/10.3390/s22228846
APA StyleDavydov, Y., Chen, W.-H., & Lin, Y.-C. (2022). Supervised Object-Specific Distance Estimation from Monocular Images for Autonomous Driving. Sensors, 22(22), 8846. https://doi.org/10.3390/s22228846