FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection
<p>Comparison of our method with previous LiDAR-based two-stage methods. LiDAR-based methods often struggle to determine object categories and produce less confident scores correctly. The confidence score indicates the likelihood of an object (e.g., vehicle) being present in the box and the accuracy of the bounding box. Our approach, on the other hand, effectively integrates point cloud structure with dense image information, allowing us to overcome these challenges.</p> "> Figure 2
<p>Overall architecture of FusionRCNN. Given 3D proposals, LiDAR and image features are extracted separately through the RoI feature extractor. Then, the features are fed into <span class="html-italic">K</span> fusion encoding layers which comprise self-attention and cross-attention modules. Finally, point features fused with image information are further fed into a decoder and predict the refined 3D bounding boxes and confidence scores.</p> "> Figure 3
<p>Visualizations of attention map. First Row: input images and the predictions of object queries projected on the images (painted in <span style="color: #00FF00">green</span>), and the circumscribed rectangle of expanding predictions projected on the images are painted in <span style="color: #0000FF">blue</span>. Second Row: Intra-modality self-attention maps within the expanding RoI area of the image branch with high attention weights on the part of the vehicles and background. Third Row: Inter-modality cross-attention maps of image branch, higher attention weights are on the vehicles. Our fusion strategy can dynamically choose relevant image features as supplementary information with the help of the Intra-modality and Inter-modality attention modules. The two images are picked from the KITTI dataset.</p> "> Figure 4
<p>Qualitative comparison between LiDAR-based two-stage detector (CT3D) and our FusionRCNN on the Waymo Open Dataset. Ground-truth and predictions are painted in <span style="color: #00FF00">Green</span> and <span style="color: #0000FF">Blue</span>, respectively. Three proposal vehicles in red circles are zoom-in and visualized on 2D images and 3D point clouds. Our FusionRCNN works better than CT3D with only LiDAR input in long-range detection.</p> "> Figure 5
<p>Qualitative comparison between LiDAR-based two-stage detector (CT3D) and our FusionRCNN on the KITTI Dataset. <span style="color: #00FF00">Green</span> boxes and <span style="color: #0000FF">Blue</span> boxes are ground-truth and predictions, respectively. Our FusionRCNN performs better than CT3D with only LiDAR input in long-range detection.</p> "> Figure 6
<p>Framework comparison of FusionRCNN and FusionRCNN-L. FusionRCNN-L is a LiDAR-based method that disables intra-modality self-attention of the image branch and inter-modality cross-attention module in Fusion Encoder.</p> ">
Abstract
:1. Introduction
- We propose a versatile and efficient two-stage multi-modality 3D detector, FusionRCNN. The detector combines image and point clouds within regions of interest and can enhance existing one-stage detectors with minor modifications.
- We introduce a novel transformer-based mechanism that enables the simultaneous achievement of attentive fusion between pixel and point sets, providing rich context and structural information.
- Our method demonstrates superior performance when compared to two-stage approaches on challenging samples that have sparse points in both the KITTI and Waymo Open Dataset.
2. Related Works
2.1. LiDAR-Based 3D Detection
2.2. Camera-Based 3D Detection
2.3. LiDAR-Camera 3D Detection
2.4. Transformer for Object Detection
3. Method
3.1. RoI Feature Extractor
3.2. Fusion Encoder
3.2.1. Intra-Modality Self-Attention
3.2.2. Inter-Modality Cross-Attention
3.3. Decoder
3.4. Learning Objectives
4. Experiments
4.1. Implementation Details
4.2. Results on Waymo
4.3. Results on KITTI
4.4. Ablation Studies
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
3D | Three dimensional |
LiDAR | Light detection and ranging |
BEV | Bird’s eye view |
RoI | Regions of Interest |
mAP | Mean average precision |
AP | Average precision |
mAPH | Mean average precision weighted by heading |
IoU | Intersection over union |
RPN | Region proposal network |
GT | Ground truth |
Appendix A
Algorithm A1 Proposed RoI Feature Extractor in FusionRCNN |
|
References
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Li, Z.; Wang, F.; Wang, N. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the CVPR, Virtual. 19–25 June 2021. [Google Scholar]
- Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J. Improving 3d object detection with channel-wise transformer. In Proceedings of the ICCV, Virtual. 10 March 2021. [Google Scholar]
- Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2647–2664. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In Proceedings of the AAAI, Virtual. 2–9 February 2021. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the CVPR, Washington, DC, USA, 14–19 June 2020. [Google Scholar]
- Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. arXiv 2021, arXiv:2102.00463. [Google Scholar] [CrossRef]
- Kuras, A.; Brell, M.; Liland, K.H.; Burud, I. Multitemporal Feature-Level Fusion on Hyperspectral and LiDAR Data in the Urban Environment. Remote Sens. 2023, 15, 632. [Google Scholar] [CrossRef]
- Shrestha, B.; Stephen, H.; Ahmad, S. Impervious surfaces mapping at city scale by fusion of radar and optical data through a random forest classifier. Remote Sens. 2021, 13, 3040. [Google Scholar] [CrossRef]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the CVPR, New Orleans, LA, USA, 21 June 2022. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the CVPR, Washington, DC, USA, 14–19 June 2020. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, D.Z.; Posner, I. Voting for voting in online point cloud object detection. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Volume 1, pp. 10–15. [Google Scholar]
- Song, S.; Xiao, J. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–28 June 2018. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the CVPR, Washington, DC, USA, 14–19 June 2020. [Google Scholar]
- Fan, L.; Xiong, X.; Wang, F.; Wang, N.; Zhang, Z. Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the ICCV, Virtual. 10 March 2021. [Google Scholar]
- Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; Anguelov, D. Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the CVPR, Virtual. 9–25 June 2021. [Google Scholar]
- Wang, T.; Zhu, X.; Pang, J.; Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, USA, 10–17 October 2021; pp. 913–922. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning, PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]
- Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2781–2790. [Google Scholar]
- Wang, Y.; Guizilini, V.C.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning, PMLR, Cambridge, MA, USA, 16–18 November 2022; pp. 180–191. [Google Scholar]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 531–548. [Google Scholar]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Graph-DETR3D: Rethinking overlapping regions for multi-view 3D object detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5999–6008. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 194–210. [Google Scholar]
- Pan, B.; Sun, J.; Leung, H.Y.T.; Andonian, A.; Zhou, B. Cross-view semantic segmentation for sensing surroundings. IEEE Robot. Autom. Lett. 2020, 5, 4867–4873. [Google Scholar] [CrossRef]
- Roddick, T.; Cipolla, R. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 11138–11147. [Google Scholar]
- Roddick, T.; Kendall, A.; Cipolla, R. Orthographic feature transform for monocular 3d object detection. arXiv 2018, arXiv:1811.08188. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M^ 2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
- Reading, C.; Harakeh, A.; Chae, J.; Waslander, S.L. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8555–8564. [Google Scholar]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv 2022, arXiv:2206.10092. [Google Scholar]
- Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–18. [Google Scholar]
- Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, Q.; Wang, T.; Zhang, X.; Sun, J. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv 2022, arXiv:2206.01256. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the CVPR, Washington, DC, USA, 14–19 June 2020. [Google Scholar]
- Wang, C.; Ma, C.; Zhu, M.; Yang, X. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the CVPR, Virtual. 19–25 June 2021. [Google Scholar]
- Meyer, G.P.; Charland, J.; Hegde, D.; Laddha, A.; Vallespi-Gonzalez, C. Sensor fusion for joint 3d object detection and semantic segmentation. In Proceedings of the CVPRW, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. FusionPainting: Multimodal fusion with adaptive attention for 3d object detection. In Proceedings of the ITSC, Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
- Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3d: A unified sensor fusion framework for 3d detection. arXiv 2022, arXiv:2203.10642. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual. 19–25 June 2021; pp. 14454–14463. [Google Scholar]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
- Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3164–3173. [Google Scholar]
- Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the CVPR, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Sun, P.; Tan, M.; Wang, W.; Liu, C.; Xia, F.; Leng, Z.; Anguelov, D. Swformer: Sparse window transformer for 3d object detection in point clouds. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 426–442. [Google Scholar]
- Dong, S.; Ding, L.; Wang, H.; Xu, T.; Xu, X.; Wang, J.; Bian, Z.; Wang, Y.; Li, J. MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. In Proceedings of the Advances in Neural Information Processing Systems, Virtual. 6–14 December 2022. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NA, USA, 27–30 June 2016. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the CVPR, Los Alamitos, CA, USA, 16–21 June 2012. [Google Scholar]
- Team, O.D. OpenPCDet: An Open-source Toolbox for 3D Object Detection from Point Clouds. 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 12 January 2023).
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPs, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Proceedings of the CoRL, Auckland, NZ, USA, 14–18 December 2020. [Google Scholar]
- Wang, Y.; Fathi, A.; Kundu, A.; Ross, D.A.; Pantofaru, C.; Funkhouser, T.; Solomon, J. Pillar-based object detection for autonomous driving. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; Xu, C. Pyramid r-cnn: Towards better performance and adaptability for 3d object detection. In Proceedings of the ICCV, Virtual. 10 March 2021. [Google Scholar]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the CVPR, Virtual. 19–25 June 2021. [Google Scholar]
- Qi, C.R.; Zhou, Y.; Najibi, M.; Sun, P.; Vo, K.; Deng, B.; Anguelov, D. Offboard 3d object detection from point cloud sequences. In Proceedings of the CVPR, Virtual. 19–25 June 2021. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IROS, Madrid, Spain, 1–5 October 2018. [Google Scholar]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the CVPR, Washington, DC, USA, 14–19 June 2020. [Google Scholar]
Difficulty | Method | Reference | 3D Detection—Vehicle | BEV Detection—Vehicle | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Overall | 0–30 m | 30–50 m | 50 m-Inf | Overall | 0–30 m | 30–50 m | 50 m-Inf | |||
LEVEL_1 | SECOND * [16] | Sensor 2018 | 72.46 | 90.30 | 70.52 | 46.93 | 89.42 | 96.58 | 88.76 | 77.55 |
PointPillar [67] | CVPR 2019 | 56.62 | 81.01 | 51.75 | 27.94 | 75.57 | 92.10 | 74.06 | 55.47 | |
MVF [68] | CoRL 2020 | 62.93 | 86.30 | 60.02 | 36.02 | 80.40 | 93.59 | 79.21 | 63.09 | |
Pillar-OD [69] | arXiv 2020 | 69.80 | 88.53 | 66.50 | 42.93 | 87.11 | 95.78 | 84.87 | 72.12 | |
PV-RCNN [7] | CVPR 2020 | 70.30 | 91.92 | 69.21 | 42.17 | 82.96 | 97.35 | 82.99 | 64.97 | |
Voxel-RCNN [6] | AAAI 2021 | 75.59 | 92.49 | 74.09 | 53.15 | 88.19 | 97.62 | 87.34 | 77.70 | |
LiDAR-RCNN [3] | CVPR 2021 | 76.00 | 92.10 | 74.60 | 54.50 | 90.10 | 97.0 | 89.50 | 78.90 | |
Pyramid R-CNN [70] | ICCV 2021 | 76.30 | 92.67 | 74.91 | 54.54 | - | - | - | - | |
CT3D [4] | ICCV 2021 | 76.30 | 92.51 | 75.07 | 55.36 | 90.50 | 97.64 | 88.06 | 78.89 | |
FusionRCNN (Ours) | - | 78.91 | 92.38 | 77.82 | 58.81 | 91.94 | 97.12 | 91.22 | 85.22 | |
LEVEL_2 | SECOND * [16] | Sensor 2018 | 64.14 | 89.04 | 64.14 | 35.98 | 82.23 | 95.63 | 83.26 | 64.29 |
PV-RCNN [7] | CVPR 2020 | 65.36 | 91.58 | 65.13 | 36.46 | 77.45 | 94.64 | 80.39 | 55.39 | |
Voxel-RCNN [6] | AAAI 2021 | 66.59 | 91.74 | 67.89 | 40.80 | 81.07 | 96.99 | 81.37 | 63.26 | |
LiDAR-RCNN [3] | CVPR 2021 | 68.30 | 91.30 | 68.50 | 42.40 | 81.70 | 94.30 | 82.30 | 65.80 | |
CT3D [4] | ICCV 2021 | 69.04 | 91.76 | 68.93 | 42.60 | 81.74 | 97.05 | 82.22 | 64.34 | |
FusionRCNN (Ours) | - | 70.33 | 91.22 | 71.47 | 46.21 | 84.39 | 96.22 | 86.15 | 70.18 |
Difficulty | Method | Vehicle | Pedestrian | Cyclist | |||
---|---|---|---|---|---|---|---|
mAP | mAPH | mAP | mAPH | mAP | mAPH | ||
LEVEL_1 | SECOND [16] | 70.96 | 70.34 | 65.23 | 54.22 | 57.13 | 55.62 |
+FusionRCNN | 77.67(+6.71%) | 77.10(+6.76%) | 70.63(+5.40%) | 61.88(+7.66%) | 67.55(+10.42%) | 66.17(+10.55%) | |
CenterPoint [71] | 72.76 | 72.23 | 74.19 | 67.96 | 71.04 | 69.79 | |
+FusionRCNN | 75.09(+2.33%) | 74.66(+2.43%) | 80.84(+6.65%) | 75.37(+7.41%) | 71.80(+0.76%) | 70.79(+1.00%) | |
LEVEL_2 | SECOND | 62.58 | 62.02 | 57.22 | 47.49 | 54.97 | 53.53 |
+FusionRCNN | 68.84(+6.26%) | 68.32(+6.30%) | 62.67(+5.45%) | 54.66(+7.17%) | 64.67(+9.70%) | 63.36(+9.83%) | |
CenterPoint | 64.91 | 64.42 | 66.03 | 60.34 | 68.49 | 67.28 | |
+FusionRCNN | 66.27(+1.36%) | 65.88(+1.46%) | 72.46(+6.43%) | 67.32(+6.98%) | 69.14(+0.65%) | 68.17(+0.89%) |
Method | Modality | Vehicle | |
---|---|---|---|
Normal | Strict | ||
PointPillars [67] | L | 72.08 | 36.83 |
PV-RCNN * [7] | L | 70.47 | 39.16 |
MVF++ * [72] | L | 74.64 | 43.30 |
SST [59] | L | 74.22 | 44.08 |
FusionRCNN(Ours) | LC | 78.91 | 47.02 |
Method | Modality | 3D Detection—Car | ||
---|---|---|---|---|
Easy | Mod. | Hard | ||
MV3D [19] | LC | 71.29 | 62.68 | 56.56 |
ContFuse [51] | LC | - | 73.25 | - |
AVOD-FPN [73] | LC | - | 74.44 | - |
F-PointNet [46] | LC | 83.76 | 70.92 | 63.65 |
PI-RCNN [11] | LC | 88.27 | 78.53 | 77.75 |
3D-CVF at SPA [74] | LC | 89.67 | 79.88 | 78.47 |
PointPillars [67] | L | 86.62 | 76.06 | 68.91 |
STD [2] | L | 89.70 | 79.80 | 79.30 |
PointRCNN [1] | L | 88.88 | 78.63 | 77.38 |
SA-SSD [75] | L | 90.15 | 79.91 | 78.78 |
3DSSD [23] | L | 89.71 | 79.45 | 78.67 |
PV-RCNN [7] | L | 89.35 | 83.69 | 78.70 |
Voxel-RCNN [6] | L | 89.41 | 84.52 | 78.93 |
Pyramid R-CNN [70] | L | 89.37 | 84.38 | 78.84 |
CT3D [4] | L | 89.54 | 86.06 | 78.99 |
SECOND [16] | L | 88.61 | 78.62 | 77.22 |
+FusionRCNN (Ours) | LC | 89.90 (+1.29%) | 86.45 (+7.93%) | 79.32 (+2.10%) |
Method | Modality | 3D Detection—Car | ||
---|---|---|---|---|
Easy | Mod. | Hard | ||
MV3D [19] | LC | 74.97 | 63.63 | 54.00 |
ContFuse [51] | LC | 83.68 | 68.78 | 61.67 |
AVOD-FPN [73] | LC | 83.07 | 71.76 | 65.73 |
F-PointNet [46] | LC | 82.19 | 69.79 | 60.59 |
PI-RCNN [11] | LC | 84.37 | 74.82 | 70.03 |
3D-CVF at SPA [74] | LC | 89.20 | 80.67 | 77.15 |
PointPillars [67] | L | 82.58 | 74.31 | 68.99 |
STD [2] | L | 87.95 | 79.71 | 75.09 |
PointRCNN [1] | L | 86.96 | 75.64 | 70.70 |
SA-SSD [75] | L | 88.75 | 79.79 | 74.16 |
3DSSD [23] | L | 88.36 | 79.57 | 74.55 |
PV-RCNN [7] | L | 90.25 | 81.43 | 76.82 |
Voxel-RCNN [6] | L | 90.90 | 81.62 | 77.06 |
CT3D [4] | L | 87.83 | 81.77 | 77.16 |
SECOND [16] | L | 83.34 | 72.55 | 65.82 |
+FusionRCNN (Ours) | LC | 88.12 (+4.78%) | 81.98 (+9.43%) | 77.53 (+11.71%) |
Method | Overall | 0–30 m | 30–50 m | 50 m-Inf | Latency (ms) |
---|---|---|---|---|---|
FusionRCNN-L | 90.25 | 96.58 | 89.24 | 80.61 | 125 |
FusionRCNN | 91.94 (+1.69%) | 97.12 (+0.54%) | 91.22 (+1.98%) | 85.22 (+4.61%) | 185 (+60) |
Methods | LEVEL_1 | LEVEL_2 | ||
---|---|---|---|---|
3D AP | APH | 3D AP | APH | |
SECOND [16] | 72.46 | 71.87 | 64.14 | 63.60 |
+FusionRCNN | 78.91 (+6.45%) | 78.39 (+6.52%) | 70.65 (+6.51%) | 70.16 (+6.56%) |
PointPillar [67] | 72.27 | 71.69 | 63.85 | 63.33 |
+FusionRCNN | 74.67 (+2.40%) | 74.10 (+2.41%) | 65.96 (+2.11%) | 65.44 (+2.11%) |
CenterPoint [71] | 72.08 | 71.53 | 63.55 | 63.06 |
+FusionRCNN | 77.63 (+5.55%) | 77.16 (+5.63%) | 69.26 (+5.71%) | 68.83 (+5.77%) |
Output Size | LEVEL_1 | LEVEL_2 |
---|---|---|
3D AP/APH | 3D AP/APH | |
78.88/78.36 | 70.63/70.14 | |
78.82/78.30 | 70.57/70.10 | |
78.91/78.39 | 70.65/70.16 | |
78.87/78.37 | 70.62/70.13 |
Expansion Ratio | Operation | LEVEL_1 | LEVEL_2 |
---|---|---|---|
3D AP/APH | 3D AP/APH | ||
k = 1.2 | RoIAlign | 78.47/77.95 | 69.88/69.71 |
RoIPooling | 78.41/77.88 | 69.81/69.63 | |
k = 1.5 | RoIAlign | 78.62/78.09 | 70.11/69.85 |
RoIPooling | 78.63/78.11 | 70.36/69.87 | |
k = 2.0 | RoIAlign | 78.83/78.31 | 70.57/70.11 |
RoIPooling | 78.91/78.39 | 70.65/70.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, X.; Dong, S.; Xu, T.; Ding, L.; Wang, J.; Jiang, P.; Song, L.; Li, J. FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection. Remote Sens. 2023, 15, 1839. https://doi.org/10.3390/rs15071839
Xu X, Dong S, Xu T, Ding L, Wang J, Jiang P, Song L, Li J. FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection. Remote Sensing. 2023; 15(7):1839. https://doi.org/10.3390/rs15071839
Chicago/Turabian StyleXu, Xinli, Shaocong Dong, Tingfa Xu, Lihe Ding, Jie Wang, Peng Jiang, Liqiang Song, and Jianan Li. 2023. "FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection" Remote Sensing 15, no. 7: 1839. https://doi.org/10.3390/rs15071839
APA StyleXu, X., Dong, S., Xu, T., Ding, L., Wang, J., Jiang, P., Song, L., & Li, J. (2023). FusionRCNN: LiDAR-Camera Fusion for Two-Stage 3D Object Detection. Remote Sensing, 15(7), 1839. https://doi.org/10.3390/rs15071839