Real-Time 3D Multi-Object Detection and Localization Based on Deep Learning for Road and Railway Smart Mobility
<p>Illustration of our 3D bounding box detector. A single RGB image is used as the input for our method; the shared convolutional features are then extracted by the network backbone, Darknet-53. We leverage the proven 2D object detector, YOLOv3, to perform the RoI and object class prediction. We then extract the RoI features by using the feature alignment used in [<a href="#B14-jimaging-07-00145" class="html-bibr">14</a>]. The 3D bounding box parameters are predicted by our CNN’s parameter prediction, and finally the 3D bounding box is drawn on the image.</p> "> Figure 2
<p>Illustration of the object azimuth <math display="inline"><semantics> <mi>θ</mi> </semantics></math> and its observed orientation <math display="inline"><semantics> <mi>α</mi> </semantics></math>. The local orientation is retrieved by computing the angle between the normal to the ray between the camera and the object center and the X-axis of the camera. Given that we are using left-hand coordinates, the rotation is clockwise. Our method estimates the observed orientation and <math display="inline"><semantics> <mi>θ</mi> </semantics></math> can be obtained using Equation (<a href="#FD2-jimaging-07-00145" class="html-disp-formula">2</a>).</p> "> Figure 3
<p>In these graphs, we compare the orientation score obtained by our method (with or without ground truth box) on the different dataset classes; we also include the results of the “3D joint monocular” method (which also uses ground truth boxes). We can see that our method has a lower orientation score when we do not use Ground truth (GT) bounding boxes. This can be explained by the fact that the boxes used for feature alignment during inference are the same as those used during training.</p> "> Figure 4
<p>In these graphs, we compare the depth RMSE obtained by our method (with or without ground truth box) on the different dataset classes; we also include the results of the “3D joint monocular” method (which also uses ground truth boxes). We can see that our method obtains a higher RMSE error when we do not use GT bounding boxes. This can be explained by the fact that the boxes used for feature alignment during inference are the same as those used during training. We can also see that there is a significant loss in accuracy on smaller classes such as bicycles or people when our method predicts RoIs using YOLOv3 instead of using ground truth. This can be explained by the fact that variations in the prediction of RoIs have a greater impact than for larger classes like cars.</p> "> Figure 5
<p>Qualitative results of our method were obtained through KITTI and our GTAV datasets. These images were extracted from the validation split of each dataset. The RoIs used for predicting the 3D bounding box parameters were computed through YOLOv3. 4 top lines: results obtained for GTAV dataset (left column: ground truth; right column: prediction), 4 bottom lines: results obtained for KITTI dataset (left column: ground truth; right column: prediction).</p> ">
Abstract
:1. Introduction
- A new method for a real-time multi-class 3D object detection network that is trainable end-to-end. We leverage the popular real-time object detector You Only Look Once (YOLO) v3 for regions of interest (RoIs) prediction used in our network. Our methods allow the following predictions:
- Object 2D detection;
- Object distance from the camera;
- Object 3D centers projected on the image plane;
- Object 3D dimension and orientation.
With these predictions, our method is able to draw 3D bounding boxes for the objects. - A new photo-realistic virtual multi-modal dataset for 3D detection in the road and railway environment using the video game Grand Theft Auto (GTA) V. The GTAV dataset includes images taken from the point of view of both cars and trains.
2. Related Work
2.1. 3D Object Detection from a Depth Sensor
2.2. 3D Object Detection from Monocular Images
3. Our Realtime 3D Multi-Object Detection and Localization Network
3.1. 3D Bounding Estimation
3.2. Parameters to Regress
3.3. Losses
4. Experimental Results
4.1. Training Details
4.2. Evaluation
4.3. Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ADAS | Advanced driver-assistance system |
GTA | Grand Theft Auto |
ToF | Time of flight |
CNN | Convolutional neural networks |
MLP | Multi-layer perceptron |
PointRGCN | Pipeline based on graph convolutional networks |
RoIs | Regions of interest |
RPN | Region proposal network |
NMS | Non-maximum suppression |
GCN | Graph convolutional networks |
ARE | Absolute relative error |
RMSE | Root-mean-square error |
BMP | Bad matching pixels |
DS | Dimension score |
CS | Center score |
OS | Orientation score |
YOLO | You only look once |
LiDAR | Light detection and ranging |
GT | Ground truth |
References
- Zhao, M.; Mammeri, A.; Boukerche, A. Distance measurement system for smart vehicles. In Proceedings of the 2015 7th International Conference on New Technologies, Mobility and Security (NTMS), Paris, France, 27–29 July 2015; pp. 1–5. [Google Scholar]
- Palacín, J.; Pallejà, T.; Tresanchez, M.; Sanz, R.; Llorens, J.; Ribes-Dasi, M.; Masip, J.; Arno, J.; Escola, A.; Rosell, J.R. Real-time tree-foliage surface estimation using a ground laser scanner. IEEE Trans. Instrum. Meas. 2007, 56, 1377–1383. [Google Scholar] [CrossRef] [Green Version]
- Kang, B.; Kim, S.J.; Lee, S.; Lee, K.; Kim, J.D.; Kim, C.Y. Harmonic distortion free distance estimation in ToF camera. In Three-Dimensional Imaging, Interaction, and Measurement; International Society for Optics and Photonics: Bellingham, WA, USA, 2011; Volume 7864, p. 786403. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv 2017, arXiv:cs.CV/1612.00593. [Google Scholar]
- Zarzar, J.; Giancola, S.; Ghanem, B. PointRGCN: Graph Convolution Networks for 3D Vehicles Detection Refinement. arXiv 2019, arXiv:cs.CV/1911.12236. [Google Scholar]
- Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:cs.GR/1512.03012. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. arXiv 2018, arXiv:cs.CV/1712.02294. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D object detection for autonomous driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar] [CrossRef]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. arXiv 2017, arXiv:cs.CV/1612.00496. [Google Scholar]
- Hu, H.N.; Cai, Q.Z.; Wang, D.; Lin, J.; Sun, M.; Krähenbühl, P.; Darrell, T.; Yu, F. Joint Monocular 3D Vehicle Detection and Tracking. arXiv 2019, arXiv:cs.CV/1811.10742. [Google Scholar]
- Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image. arXiv 2017, arXiv:cs.CV/1703.07570. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 29 September 2019).
- Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In European Conference on Computer Vision (ECCV); Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Volume 9906, pp. 102–118. [Google Scholar]
- Smith, L.N.; Topin, N. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv 2018, arXiv:cs.LG/1708.07120. [Google Scholar]
- Zendel, O.; Murschitz, M.; Zeilinger, M.; Steininger, D.; Abbasi, S.; Beleznai, C. RailSem19: A dataset for semantic rail scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 1221–1229. [Google Scholar]
Dataset | Method | 2D Detection | Distance | Dimensions DS | Center CS | Orientation OS | VRAM Usage | Inference Time | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP | R | mAP | F1 | Abs Rel | SRE | RMSE (m) | log RMSE | ||||||||||
KITTI | Ours (w GT RoIs) | - | - | - | - | 0.096 | 0.307 | 2.96 | 0.175 | 0.941 | 0.980 | 0.988 | 0.847 | 0.983 | 0.753 | 3.22 GB | 10 ms |
Ours (w/o GT RoIs) | 0.469 | 0.589 | 0.5 | 0.517 | 0.199 | 2.32 | 5.89 | 0.310 | 0.823 | 0.934 | 0.960 | 0.853 | 0.951 | 0.765 | 3.22 GB | 10 ms | |
Joint Monocular 3D [11] | - | - | - | - | 0.074 | 0.449 | 2.847 | 0.126 | 0.954 | 0.980 | 0.987 | 0.962 | 0.918 | 0.974 | 5.64 GB | 97 ms | |
GTA | Ours (w GT RoIs) | - | - | - | - | 0.069 | 0.420 | 4.60 | 0.100 | 0.965 | 0.992 | 0.999 | 0.886 | 0.999 | 0.961 | 3.22 GB | 10 ms |
Ours (w/o GT RoIs) | 0.533 | 0.763 | 0.632 | 0.61 | 0.207 | 4.92 | 12.3 | 0.319 | 0.781 | 0.897 | 0.942 | 0.853 | 0.846 | 0.845 | 3.22 GB |
Dataset | Classes | Method | 2D Detection | Distance | Dimensions | Center | Orientation | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AP | R | mAP | F1 | Abs Rel | SRE | RMSE (m) | log RMSE | DS | CS | OS | ||||||
KITTI | Car | Ours (w GT RoIs) | - | - | - | - | 0.0957 | 0.312 | 3.05 | 0.18 | 0.941 | 0.980 | 0.988 | 0.872 | 0.981 | 0.781 |
Ours (w/o GT RoIs) | 0.558 | 0.879 | 0.783 | 0.682 | 0.163 | 1.29 | 5.28 | 0.271 | 0.839 | 0.948 | 0.971 | 0.867 | 0.962 | 0.783 | ||
Person | Ours (w GT RoIs) | - | - | - | - | 0.0979 | 0.254 | 2.01 | 0.154 | 0.935 | 0.983 | 0.992 | 0.705 | 0.993 | 0.584 | |
Ours (w/o GT RoIs) | 0.508 | 0.518 | 0.464 | 0.513 | 0.424 | 7.92 | 7.70 | 0.485 | 0.729 | 0.844 | 0.892 | 0.719 | 0.885 | 0.616 | ||
Cyclist | Ours (w GT RoIs) | - | - | - | - | 0.0979 | 0.359 | 3.57 | 0.150 | 0.940 | 0.979 | 0.988 | 0.809 | 0.997 | 0.738 | |
Ours (w/o GT RoIs) | 0.342 | 0.371 | 0.253 | 0.356 | 1.04 | 30.9 | 17.4 | 0.797 | 0.415 | 0.622 | 0.719 | 0.812 | 0.678 | 0.542 | ||
GTA | Car | Ours (w GT RoIs) | - | - | - | - | 0.0552 | 0.385 | 4.74 | 0.0761 | 0.990 | 0.999 | 0.999 | 0.860 | 0.999 | 0.993 |
Ours (w/o GT RoIs) | 0.477 | 0.915 | 0.778 | 0.627 | 0.129 | 2.28 | 9.59 | 0.217 | 0.873 | 0.938 | 0.965 | 0.812 | 0.894 | 0.905 | ||
Truck | Ours (w GT RoIs) | - | - | - | - | 0.0454 | 0.178 | 3.22 | 0.0576 | 0.994 | 1.00 | 1.00 | 0.871 | 0.999 | 0.989 | |
Ours (w/o GT RoIs) | 0.312 | 0.880 | 0.617 | 0.461 | 0.259 | 6.99 | 19.2 | 0.455 | 0.642 | 0.78 | 0.868 | 0.736 | 0.622 | 0.681 | ||
Motorcycle | Ours (w GT RoIs) | - | - | - | - | 0.0623 | 0.266 | 3.84 | 0.0753 | 1.00 | 1.00 | 1.00 | 0.918 | 1.00 | 0.967 | |
Ours (w/o GT RoIs) | 0.614 | 0.764 | 0.725 | 0.681 | 0.186 | 3.58 | 13.7 | 0.281 | 0.691 | 0.878 | 0.967 | 0.901 | 0.689 | 0.909 | ||
Person | Ours (w GT RoIs) | - | - | - | - | 0.116 | 0.612 | 4.56 | 0.159 | 0.877 | 0.966 | 1.00 | 0.963 | 0.997 | 0.856 | |
Ours (w/o GT RoIs) | 0.263 | 0.659 | 0.488 | 0.375 | 0.372 | 10.5 | 15.1 | 0.453 | 0.620 | 0.834 | 0.903 | 0.961 | 0.812 | 0.735 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mauri, A.; Khemmar, R.; Decoux, B.; Haddad, M.; Boutteau, R. Real-Time 3D Multi-Object Detection and Localization Based on Deep Learning for Road and Railway Smart Mobility. J. Imaging 2021, 7, 145. https://doi.org/10.3390/jimaging7080145
Mauri A, Khemmar R, Decoux B, Haddad M, Boutteau R. Real-Time 3D Multi-Object Detection and Localization Based on Deep Learning for Road and Railway Smart Mobility. Journal of Imaging. 2021; 7(8):145. https://doi.org/10.3390/jimaging7080145
Chicago/Turabian StyleMauri, Antoine, Redouane Khemmar, Benoit Decoux, Madjid Haddad, and Rémi Boutteau. 2021. "Real-Time 3D Multi-Object Detection and Localization Based on Deep Learning for Road and Railway Smart Mobility" Journal of Imaging 7, no. 8: 145. https://doi.org/10.3390/jimaging7080145