Abstract
Current LiDAR-camera 3D detectors adopt a 3D-2D design pattern. However, this paradigm ignores the dimensional gap between heterogeneous modalities (e.g., coordinate system, data distribution), leading to difficulties in marrying the geometric and semantic information of two modalities. Moreover, conventional 3D convolution neural networks (3D CNNs) backbone leads to limited receptive fields, which discourages the interaction between multi-modal features, especially in capturing long-range object context information. To this end, we propose a Unified Representation Transformer-based multi-modal 3D detector (URFormer) with better representation scheme and cross-modality interaction, which consists of three crucial components. First, we propose Depth-Aware Lift Module (DALM), which exploits depth information in 2D modality and lifts 2D representation into 3D at the pixel level, and naturally unifies inconsistent multi-modal representation. Second, we design a Sparse Transformer (SPTR) to enlarge effective receptive fields and capture long-range object semantic features for better interaction in multi-modal features. Finally, we design Unified Representation Fusion (URFusion) to integrate cross-modality features in a fine-grain manner. Extensive experiments are conducted to demonstrate the effectiveness of our method on KITTI benchmark and show remarkable performance compared to the state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)
Chen, Y., Li, Y., Zhang, X., Sun, J., Jia, J.: Focal sparse convolutional networks for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5428–5437 (2022)
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1201–1209 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8417–8427 (2022)
Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: Penet: towards precise and efficient image guided depth completion. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13656–13662. IEEE (2021)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345–7353 (2019)
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)
Pang, S., Morris, D., Radha, H.: Fast-clocs: fast camera-lidar object candidates fusion for 3D object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 187–196 (2022)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538 (2020)
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2647–2664 (2020)
Song, Z., Jia, C., Yang, L., Wei, H., Liu, L.: Graphalign++: an accurate feature alignment by graph matching for multi-modal 3D object detection. IEEE Trans. Circuits Syst. Video Technol. (2023)
Song, Z., Wei, H., Jia, C., Xia, Y., Li, X., Zhang, C.: Vp-net: voxels as points for 3-D object detection. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023). https://doi.org/10.1109/TGRS.2023.3271020
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Wang, L., et al.: SAT-GCN: self-attention graph convolutional network-based 3D object detection for autonomous driving. Knowl.-Based Syst. 259, 110080 (2023)
Wang, L., et al.: Multi-modal 3D object detection in autonomous driving: a survey and taxonomy. IEEE Trans. Intell. Veh. 8(7), 3781–3798 . https://doi.org/10.1109/TIV.2023.3264658
Wu, X., et al.: Sparse fuse dense: towards high quality 3D detection with depth completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5418–5427 (2022)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Yang, L., et al.: Bevheight: a robust framework for vision-based roadside 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21611–21620, June 2023
Yin, T., Zhou, X., Krähenbühl, P.: Multimodal virtual point 3D detection. In: Advances in Neural Information Processing Systems, vol. 34, pp. 16494–16507 (2021)
Zhang, X., et al.: Ri-fusion: 3D object detection using enhanced point features with range-image fusion for autonomous driving. IEEE Trans. Instrum. Meas. 72, 1–13 (2022)
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities (2023YJS019) and the STI 2030-Major Projects under Grant 2021ZD0201404 (Funded by Jun Xie and Zhepeng Wang from Lenovo Research).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, G., Xie, J., Liu, L., Wang, Z., Yang, K., Song, Z. (2024). URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_32
Download citation
DOI: https://doi.org/10.1007/978-981-99-8435-0_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8434-3
Online ISBN: 978-981-99-8435-0
eBook Packages: Computer ScienceComputer Science (R0)