Part-Based Obstacle Detection Using a Multiple Output Neural Network
<p>The multiple-output neural network architecture.</p> "> Figure 2
<p>The layers of the ResNet-50 encoder.</p> "> Figure 3
<p>The layers of the proposed U-Net-based decoder.</p> "> Figure 4
<p>The CNN layers and the direct connections from the encoder (<b>left side</b>) to the decoder (<b>right side</b>), as illustrated.</p> "> Figure 5
<p>Examples of the global semantic segmentation output: the first image is the color input image of the road scene, while the second image represents the drivable road area, the third represents the dynamic objects, and the fourth image is static objects (sidewalk or fence).</p> "> Figure 6
<p>The artificial neural network, inferring the obstacle quarters.</p> "> Figure 7
<p>Object pixel coding, based on segmented quarters. Some pixels can belong to more than one quarter.</p> "> Figure 8
<p>Labeled regions, based on connected pixels of the same quarter code.</p> "> Figure 9
<p>Generation of rectangle hypotheses, based on quarters.</p> "> Figure 10
<p>Computing the rectangle score, based on quarter matching.</p> "> Figure 11
<p>Re-labeling the rectangle hypothesis by the best-scored overlap.</p> "> Figure 12
<p>The final result, obtained by labeling the regions with the rectangle labels, then transferring the region labels to the individual pixels.</p> "> Figure 13
<p>Fusing the semantic output with the quarters to improve object detection.</p> "> Figure 14
<p>The vote-map features of the vanishing point: the left side features (<b>first image</b>), the right side features (<b>center</b>), and the multiplication result of the first two images (<b>right</b>).</p> "> Figure 15
<p>Extracting the obstacle quarters from the original segmentation mask and 2D obstacle bounding boxes. Image source [<a href="#B36-sensors-22-04312" class="html-bibr">36</a>].</p> "> Figure 16
<p>Comparison between Mask R-CNN, Yolact, and our method: the first column represents the input image, the second column shows the ground truth instances, the third column shows the Mask R-CNN results, the fourth column shows the Yolact results, and the fifth column represents our results.</p> "> Figure 17
<p>The first column represents the input image, while the second column is the quarter prediction (with the CNN output merged), the third column represents the labeling output, and the fourth column is the resulting bounding boxes of the individual obstacles.</p> "> Figure 17 Cont.
<p>The first column represents the input image, while the second column is the quarter prediction (with the CNN output merged), the third column represents the labeling output, and the fourth column is the resulting bounding boxes of the individual obstacles.</p> ">
Abstract
:1. Introduction
- A multiple-head network architecture that is able to produce the geometric parts of the objects, which can be then grouped into individual instances, free and occupied pixel classification for results refinement, and voting maps for vanishing point computation;
- A lightweight object-clustering algorithm, based on the results of the multiple-head semantic segmentation network;
- An automated training solution for the multiple-head network, which uses publicly available databases without the need for manual annotation of the object parts.
2. Related Work
2.1. Multi-Task Deep Learning
2.2. Object Detection
3. Solution Description
3.1. Feature Extraction
3.2. Decoder Structure
3.3. Global Semantic Segmentation
3.4. Obstacle Reconstruction Using Part-Based Semantic Segmentation
- Bit 0—top-left quarter;
- Bit 1—top-right quarter;
- Bit 2—bottom-left quarter;
- Bit 3—bottom-right quarter.
- The rectangle that overlaps the most pixels of the individual region is selected.
- If the rectangle is dependent on another rectangle (as seen in Figure 11), the label of the main rectangle is transferred to the individual region.
Algorithm 1 Instance segmentation |
Input: TL, TR, BL, BR—Top-left, top-right, bottom-left, and bottom-right binary images Output: L—image labeled with the individual object identifiers |
1. Identification of connected regions from each of the four binary images: TL_Regions = ConnectedLabeling (TL) TR_Regions = ConnectedLabeling (TR) BL_Regions = ConnectedLabeling (BL) BR_Regions = ConnectedLabeling (BR) 2. Creation of the composite type image: For each (x,y) Composite(x,y) = BR(x,y) <<3 | BL(x,y) <<2 | TR(x,y) <<1 | TL(x,y) 3. Labeling the composite image: Regions = [] For type = 1 to 15 Regions = Regions U ConnectedLabeling (Composite = type) 4. Generating the hypothetical rectangles: TL_Rectangles = ExpandRectangle (TL_Regions) TR_Rectangles = ExpandRectangle (TR_Regions) BL_Rectangles = ExpandRectangle (BL_Regions) BR_Rectangles = ExpandRectangle (BR_Regions) R = TL_Rectangles U TR_Rectangles U BL_Rectangles U BR_Rectangles S(R) = ComputeRectangleScores (R) 5. Labeling the rectangles For each rectangle Ri LR (i) = i Repeat For i For j If Overlap (Ri, Rj) > 0.5 and S (Ri) > S (Rj) LR(j) = LR (i) Until no change of LR 6. Labeling the regions For each Reg in Regions R = ArgMax R(Overlap (R, Reg)) L(Reg) = LR(R) |
3.5. Object Refinement
3.6. Vanishing-Point Computation
4. Training the Multi-Output CNN
5. Results and Evaluation
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhong, Z.; Li, J.; Cui, W.; Jiang, H. Fully convolutional networks for building and road extraction: Preliminary results. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1591–1594. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot MultiBox detector. arXiv 2016, arXiv:1512.02325. [Google Scholar]
- David, H.; Sebastian, T.; Silvio, S. Learning to Track at 100 FPS with Deep Regression Networks. arXiv 2016, arXiv:1604.01802. [Google Scholar]
- Hu, H.-N.; Cai, Q.-Z.; Wang, D.; Lin, J.; Sun, M.; Krahenbuhl, P.; Darrell, T.; Yu, F. Joint Monocular 3D Vehicle Detection and Tracking. arXiv 2018, arXiv:1811.10742. [Google Scholar]
- Ni, J.; Chen, Y.; Chen, Y.; Zhu, J.; Ali, D.; Cao, W. A Survey on Theories and Applications for Self-Driving Cars Based on Deep Learning Methods. Appl. Sci. 2020, 10, 2749. [Google Scholar] [CrossRef]
- Muresan, M.P.; Giosan, I.; Nedevschi, S. Stabilization and Validation of 3D Object Position Using Multimodal Sensor Fusion and Semantic Segmentation. Sensors 2020, 20, 1110. [Google Scholar] [CrossRef] [Green Version]
- Shahian Jahromi, B.; Tulabandhula, T.; Cetin, S. Real-Time Hybrid Multi-Sensor Fusion Framework for Perception in Autonomous Vehicles. Sensors 2019, 19, 4357. [Google Scholar] [CrossRef] [Green Version]
- Boulay, T. YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving. arXiv 2019, arXiv:1904.05673. [Google Scholar]
- Teichmann, M. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving. arXiv 2016, arXiv:1612.07695. [Google Scholar]
- Sistu, G.; Leang, I.; Yogamani, S. Real-time Joint Object Detection and Semantic Segmentation Network for Automated Driving. arXiv 2019, arXiv:1901.03912. [Google Scholar]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv 2018, arXiv:1705.07115. [Google Scholar]
- Itu, R.; Danescu, R.G. A Self-Calibrating Probabilistic Framework for 3D Environment Perception Using Monocular Vision. Sensors 2020, 20, 1280. [Google Scholar] [CrossRef] [Green Version]
- Nedevschi, S.; Danescu, R.; Frentiu, D.; Marita, T.; Oniga, F.; Pocol, C.; Schmidt, R.; Graf, T. High accuracy stereo vision system for far distance obstacle detection. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, 14–17 June 2004; pp. 292–297. [Google Scholar]
- Kumar, G.A.; Lee, J.H.; Hwang, J.; Park, J.; Youn, S.H.; Kwon, S. LiDAR and Camera Fusion Approach for Object Distance Estimation in Self-Driving Vehicles. Symmetry 2020, 12, 324. [Google Scholar] [CrossRef] [Green Version]
- Song, W.; Yang, Y.; Fu, M.; Qiu, F.; Wang, M. Real-Time Obstacles Detection and Status Classification for Collision Warning in a Vehicle Active Safety System. IEEE Trans. Intell. Transp. Syst. 2018, 19, 758–773. [Google Scholar] [CrossRef]
- Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 511–518. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Gao, F.; Wang, C.; Li, C. A Combined Object Detection Method with Application to Pedestrian Detection. IEEE Access 2020, 8, 194457–194465. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. arXiv 2016, arXiv:1612.08242v1. [Google Scholar]
- Redmon, J. Darknet: Open Source Neural Networks in c. Available online: http://pjreddie.com/darknet/ (accessed on 4 May 2022).
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2014, arXiv:1311.2524. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Liu, H. Mask-YOLO: Efficient Instance-level Segmentation Network Based on YOLO-V2. Available online: https://ansleliu.github.io/MaskYOLO.html (accessed on 4 May 2022).
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar]
- Irem Ulku, I.; Akagunduz, E. A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images. arXiv 2022, arXiv:1912.10230. [Google Scholar]
- Itu, R.; Borza, D.; Danescu, R. Automatic extrinsic camera parameters calibration using Convolutional Neural Networks. In Proceedings of the 2017 IEEE 13th International Conference on Intelligent Computer Communication and Processing (ICCP 2017), Cluj-Napoca, Romania, 7–9 September 2017; pp. 273–278. [Google Scholar]
- Danescu, R.; Itu, R. Camera Calibration for CNN-based Generic Obstacle Detection. In Proceedings of the 19th EPIA Conference on Artificial Intelligence, Vila Real, Portugal, 3–6 September 2019; pp. 623–636. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. arXiv 2018, arXiv:1805.04687. [Google Scholar]
- Neuhold, G.; Ollmann, T.; Bulò, S.R.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5000–5009. [Google Scholar]
- Sorensen, T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. 1948, 5, 1–34. [Google Scholar]
- Itu, R.; Danescu, R. MONet—Multiple Output Network for Driver Assistance Systems Based on a Monocular Camera. In Proceedings of the 2020 IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP 2020), Cluj-Napoca, Romania, 3–5 September 2020; pp. 325–330. [Google Scholar]
- Itu, R.; Danescu, R. Object detection using part based semantic segmentation. In Proceedings of the 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP 2021), Cluj-Napoca, Romania, 28–30 October 2021; pp. 227–231. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Danescu, R.; Pantilie, C.; Oniga, F.; Nedevschi, S. Particle Grid Tracking System for Stereovision Based Obstacle Perception in Driving Environments. IEEE Intell. Transp. Syst. Mag. 2012, 4, 6–20. [Google Scholar] [CrossRef]
Solution | Stereo Dataset (IoU) | CityScapes Dataset (IoU) | KITTI Dataset (IoU) |
---|---|---|---|
MONet V1 with YOLO head [39] | 0.64 | 0.63 | - |
MONet V1 [39] | 0.74 | 0.80 | 0.55 |
Proposed system | 0.74 | 0.83 | 0.70 |
Mask R-CNN [29] | 0.74 | 0.84 | 0.68 |
DarkNet with YOLO V3 [25] | 0.56 | 0.57 | 0.88 |
Yolact [31] | 0.73 | 0.84 | 0.68 |
Solution | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|
Mask R-CNN [29] | 0.91 | 0.93 | 0.24 | 0.34 |
Yolact [31] | 0.87 | 0.67 | 0.17 | 0.24 |
Proposed system | 0.93 | 0.94 | 0.44 | 0.57 |
Solution | Total Number of Parameters (All Output Heads) | CNN Prediction Time (Seconds) | Total Computation Time (Predict + Extract Bounding Boxes) in Seconds |
---|---|---|---|
MONet V1, with YOLO obstacle detection head | 42 million | 0.062 | 0.063 |
MONet V1 [39] | 32 million | 0.043 | 0.093 |
MONet V2 [40] | 32 million | 0.043 | 0.061 |
Proposed system | 32 million | 0.043 | 0.057 |
Proposed system, without vanishing point head | 24 million | 0.034 | 0.048 |
Proposed system, without vanishing point head and without semantic head | 16 million | 0.026 | 0.040 |
DarkNet, with YOLO V3 [25] | 41 million | 0.052 | 0.052 |
Mask R-CNN [29] | 64 million | 0.23 | 0.23 |
Yolact [31] | 50 million | 0.041 | 0.044 |
Solution | CityScapes (NormDist) | VP Highway (NormDist) | VP City (NormDist) |
---|---|---|---|
MONet V1 [39] | - | 0.050 | 0.026 |
Proposed system | 0.016 | 0.013 | 0.018 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Itu, R.; Danescu, R. Part-Based Obstacle Detection Using a Multiple Output Neural Network. Sensors 2022, 22, 4312. https://doi.org/10.3390/s22124312
Itu R, Danescu R. Part-Based Obstacle Detection Using a Multiple Output Neural Network. Sensors. 2022; 22(12):4312. https://doi.org/10.3390/s22124312
Chicago/Turabian StyleItu, Razvan, and Radu Danescu. 2022. "Part-Based Obstacle Detection Using a Multiple Output Neural Network" Sensors 22, no. 12: 4312. https://doi.org/10.3390/s22124312